Principles  of  Population  Genetics 


Daniel  L.  Hart! 

Harvard  University 

Andrew  C.  Clark 

Pennsylvania  State  University 


Principles  of 
Population  Genetics 

THIRD  EDITION 


rf.l  Sinauer  Associates,  Inc.  Publishers 
wm 1 Sunderland,  Massachusetts 


THE  COVER 

This  image  represents  data  from  the  first  study  of  nucleotide  sequence  variation  in  a 
natural  population,  conducted  by  Martin  Kreitman  (1983;  Nature  304:412-417).  Each 
of  the  11  vertical  bands  represents  the  43  varying  nucleotide  sites  in  one  allele  of  the 
gene  alcohol  dehydrogenase  taken  from  a global  distribution  of  the  fruit  fly  Drosophila 
melanogaster.  The  colors  correspond  to  the  bases  at  each  site:  adenine  = green  cyto- 
sine = yellow,  guanine  = blue,  and  thymine  = red.  Note  that  at  each  position  only  two 
different  bases  were  observed.  Blocks  of  sites  in  strong  linkage  disequilibrium  can 
also  be  seen  as  repeated  patterns  of  color.  The  sequences  are  oriented  with  the  5'  end 
at  the  top,  and  the  nucleotide  corresponding  to  the  fast/slow  difference  in  the  ADH 
protein  is  the  twelfth  one  from  the  bottom. 


PRINCIPLES  OF  POPULATION  GENETICS,  Third  Edition 
Copyright  © 1997  by  Sinauer  Associates,  Inc.  All  rights  reserved. 

This  book  may  not  be  reproduced  in  whole  or  in  part  without  permission 
from  the  publisher. 

For  information  or  to  order,  address: 

Sinauer  Associates,  Inc.,  PO  Box  407 
23  Plumtree  Road,  Sunderland,  MA,  01375  USA 
FAX:  413-549-1118 

Internet:  publish@sinauer.com;  http://www.sinauer.com 

Library  of  Congress  Cataloging-in-Publication  Data 
Hartl,  Daniel  L. 

Principles  of  population  genetics  / Daniel  L.  Hartl,  Andrew  G. 
Clark.  — 3rd  ed. 

p.  cm. 

Includes  bibliographical  references  and  index. 

ISBN  0-87893-306-9  (hardcover) 

1.  Population  genetics.  2.  Quantitative  genetics.  3.  Population 
genetics — Problems,  exercises,  etc.  I.  Clark,  Andrew  G„  1954- 
II.  Title. 

QH455.H36  1997 
576.5'8 — dc21 

Printed  in  Canada 


97-34505 

CIP 


To  Barbara  and  Christine 


Table  of  Contents 


PREFACE  XI 

GENETIC  AND  STATISTICAL 
BACKGROUND  I 

Gene  Expression  and  Gene  Interaction  2 

Gene  Expression  3 
The  Genetic  Code  6 
Alleles  8 

Genotype  and  Phenotype  8 
Dominance  and  Gene  Interaction  11 
Segregation  and  Recombination  12 

Probability  in  Population  Genetics  15 

The  Addition  Rule  16 
The  Multiplication  Rule  16 
Repeated  Trials  1 7 

Phenotypic  Diversity  and  Genetic 
Variation  20 

Allele  Frequencies  in  Populations  20 
Parameters  and  Estimates  22 
The  Standard  Error  of  an  Estimate  22 

Models  in  Population  Genetics  26 

Exponential  Population  Growth  27 
Logistic  Population  Growth  31 

Summary  33 
Problems  34 

GENETIC  AND  PHENOTYPIC 
VARIATION  37 

Phenotypic  Variation  in  Natural 
Populations  37 

Continuous  Variation:  The  Normal  Distrib- 
ution 38 

Mean  and  Variance  39 


Central  Limit  Theorem  41 
Discrete  Mendelian  Variation  43 
Experimental  Methods  for  Detecting 
Genetic  Variation  44 

Protein  Electrophoresis  45 
The  Southern  Blot  Procedure  48 
The  Polymerase  Chain  Reaction  51 
Polymorphism  and  Heterozygosity  53 
Allozyme  Polymorphisms  54 
How  Representative  Are  Allozymes ? 56 
Polymorphisms  in  DNA  Sequences  57 
Nucleotide  Polymorphism  and  Nucleotide 
Diversity  57 

Uses  of  Genetic  Polymorphisms  62 

Multiple-Factor  Inheritance  64 
Summary  66 
Problems  68 

0 ORGANIZATION  OF  GENETIC 
VARIATION  74 

Random  Mating  72 

Nonoverlapping  Generations  73 
The  Hardy-Weinberg  Principle  74 
Random  Mating  of  Genotypes  versus 
Random  Union  of  Gametes  76 
Implications  of  the  Hardy-Weinberg 
Principle  79 

The  Hardy-Weinberg  Principle  in 
Operation  80 

Complications  of  Dominance  84 
Frequency  of  Heterozygotes  87 

Special  Cases  of  Random  Mating  88 

Three  or  More  Alleles  88 
X-Linked  Genes  92 


Table  of  Contents  vii 


Linkage  and  Linkage  Disequilibrium 
95 

Summary  106 

Problems  107 

POPULATION  SUBSTRUCTURE 
111 

Hierarchical  Population  Structure  111 

Reduction  in  Heterozygosity  112 
Average  Heterozygosity  114 
Wright's  ¥ Statistics  117 
Genetic  Divergence  among  Subpopulations 
120 

Isolate  Breaking:  The  Wahlund  Principle 
122 

Wahlund' s Principle  and  the  Fixation  Index 
125 

Genotype  Frequencies  in  Subdivided  Popu- 
lations 127 

Population  Genetics  in  DNA  Typing 
128 

Polymorphisms  Based  on  a Variable  Num- 
ber of  Tandem  Repeats  (VNTR)  129 
Match  Probabilities  with  Hardy-Weinberg 
Equilibrium  and  Linkage  Equilibrium 
132 

Effects  of  Population  Substructure  132 

Inbreeding  135 

Genotype  Frequencies  with  Inbreeding 
135 

Relation  between  the  Inbreeding  Coefficient 
and  the  F Statistics  139 
The  Inbreeding  Coefficient  as  a Probability 
141 

Genetic  Effects  of  Inbreeding  145 
Calculation  of  the  Inbreeding  Coefficient 
from  Pedigrees  149 
Regular  Systems  of  Mating  153 

Assortative  Mating  155 

Summary  158 

Problems  159 


0 SOURCES  OF  VARIATION  163 

Mutation  163 

Irreversible  Mutation  164 
Reversible  Mutation  168 
Probability  of  Fixation  of  a New  Neutral 
Mutation  170 

The  Infinite-Alleles  Model  174 
Neutral  Mutations  177 

Linkage  and  Recombination  180 

Presumed  Evolutionary  Benefit  of  Recombi- 
nation 181 

Recombination  and  Polymorphism  181 
Piecewise  Recombination  in  Bacteria  186 
Absence  of  Recombination  in  Animal 
Mitochondrial  DNA  187 

Migration  189 

One-Way  Migration  189 
The  Island  Model  of  Migration  192 
How  Migration  Limits  Genetic  Divergence 
194 

Estimates  of  Migration  Rates  196 
Patterns  of  Migration  196 
Transposable  Elements  198 

Factors  Controlling  the  Population  Dyna- 
mics of  Transposable  Elements  200 
Insertion  Sequences  and  Composite  Trans- 
posons  in  Bacteria  200 
Transposable  Elements  in  Eukaryotes  204 
Horizontal  Transmission  of  Transposable 
Elements  204 
Summary  206 
Problems  208 

jMSjl 

0 DARWINIAN  SELECTION  211 

Selection  in  Haploid  Organisms  212 

Discrete  Generations  212 
Continuous  Time  216 
Change  in  Allele  Frequency  in  Haploids 
217 


viii  Table  of  Contents 


Darwinian  Fitness  and  Malthusian  Fitness 
218 

Selection  in  Diploid  Organisms  218 

Change  in  Allele  Frequency  in  Diploids 
219 

Time  Required  for  a Given  Change  in  Allele 
Frequency  222 

Application  to  the  Evolution  of  Insecticide 
Resistance  226 

Equilibria  with  Selection  227 

Overdominance  228 
Local  Stability  232 
Heterozygote  Inferiority  234 
The  Adaptive  Topography  and  the  Role  of 
Random  Genetic  Drift  236 
Mutation-Selection  Balance  236 
Equilibrium  Allele  Frequencies  237 
The  Haldane-Midler  Principle  239 
More  Complex  Types  of  Selection  240 
Frequency-Dependent  Selection  240 
Density-Dependent  Selection  241 
Fecundity  Selection  241 
Age-Structured  Populations  242 
Heterogeneous  Environments  and  Clines 
242 

Diversifying  Selection  244 

Differential  Selection  in  the  Sexes  246 

X-linked  Genes  246 

Gametic  Selection  246 

Meiotic  Drive  247 

Multiple  Alleles  250 

Multiple  Loci  arid  Gene  Interaction: 

Epistasis  252 
Sexual  Selection  255 
Kin  Selection  256 
Interdeme  Selection  and  the  Shifting 
Balance  Theory  259 
Summary  262 
Problems  264 


RANDOM  GENETIC  DRIFT  267 

Random  Genetic  Drift  and  Binomial 
Sampling  267 

The  Wright-Fisher  Model  of  Random 
Genetic  Drift  274 
The  Diffusion  Approximation  277 

Absorption  Time  and  Time  to  Fixation  282 

Parallelism  between  Random  Drift  and 
Inbreeding  283 
Effective  Population  Size  289 

Fluctuation  in  Population  Size  290 
Unequal  Sex  Ratio,  Sex  Chromosomes, 
Organelle  Genes  292 

Balance  between  Mutation  and  Drift 
294 

Infinite  Alleles  Model  294 

The  Ewens  Sampling  Formula  296 
The  Ewens-Watterson  Test  298 

Infinite-Sites  Model  300 

Gene  Trees  and  the  Coalescent  304 

Coalescent  Models  with  Mutation  308 

Summary  310 
Problems  312 


MOLECULAR  POPULATION 
GENETICS  315 


The  Neutral  Theory  and  Molecular 
Evolution  315 

Theoretical  Principles  of  the  Neutral  Theory 
316 

Estimating  Rates  of  Molecular  Sequence 
Divergence  320 

Rates  of  Amino  Acid  Replacement  320 
Rates  of  Nucleotide  Substitution  324 
Other  Measures  of  Molecular  Divergence 
327 


The  Molecular  Clock  328 

Variation  across  Genes  in  the  Rate  of  the 
Molecular  Clock  331 


Table  of  Contents  ix 


Variation  across  Lineages  in  Clock  Rate 
333 

The  Generation-Time  Effect  336 
Does  the  Constancy  of  Substitution  Rates 
Prove  the  Neutral  Theory?  337 

Patterns  of  Nucleotide  and  Amino  Acid 
Substitution  338 

Calculating  Synonymous  and  Nonsynony- 
mous  Substitution  Rates  338 
Within-Species  Polymorphism  345 
Implications  of  Codon  Bias  348 

Polymorphism  and  Divergence  in 

Nucleotide  Sequence  Data  349 

Impact  of  Local  Recombination  Rates  353 
Gene  Genealogies  354 

Hypothesis  Testing  Using  Trees  356 
Inferences  about  Migration  Based  on  Gene 
Trees  360 

Mitochondrial  and  Chloroplast  DNA 
Evolution  361 

Chloroplast  DNA  and  Organelle 
Transmission  in  Plants  365 
Maintenance  of  Variation  in  Organelle 
Genomes  366 

Evidence  for  Selection  in  mtDNA  367 
Molecular  Phylogenetics  368 

Algorithms  for  Phylogenetic  Tree 
Reconstruction  368 
Distance  Methods  versus  Parsimony  372 
Bootstrapping  and  Statistical  Confidence 
in  a Tree  372 
Shared  Polymorphism  373 
Interspecific  Genetics  374 
Multigene  Families  374 

Causes  of  Concerted  Evolution  375 
Multigene  Family  Evolution  through  a 
Birth  and  Death  Process  378 
Structural  RNA  Genes  and  Compensatory 
Substitutions  382 
Multigene  Superfamilies  383 


Dispersed  Highly  Repetitive  DNA 
Sequences  385 

Summary  390 

Problems  392 

£2  QUANTITATIVE  GENETICS  397 

Types  of  Quantitative  Traits  398 

Resemblance  between  Relatives  and 
the  Concept  of  Heritability  400 

Artificial  Selection  and  Realized 
Heritability  406 

Prediction  Equation  for  Individual 
Selection  407 

Selection  Limits  411 

Genetic  Models  for  Quantitative  Traits 
414 

Change  in  Gene  Frequency  421 

Genetic  Model  for  the  Change  in  Mean 
Phenotype  423 

Components  of  Phenotypic  Variance  424 

Genetic  and  Environmental  Sources  of 
Variation  425 

Components  of  Genotypic  Variation  430 

Covariance  between  Relatives  434 

Twin  Studies  and  Inferences  of  Heritability 
in  Humans  440 

Experimental  Assessment  of  Genetic 
Variance  Components  442 

Indirect  Estimation  of  the  Number  of  Genes 
Affecting  a Quantitative  Character  445 

Norm  of  Reaction  and  Phenotypic 
Plasticity  448 

Threshold  Traits  and  the  Genetics  of 
Liability  452 

Correlated  Response  and  Genetic 
Correlation  454 

Inference  of  Selection  from  Phenotypic 
Data  458 

Evolution  of  Quantitative  Traits  460 


x Table  of  Contents 


Random  Genetic  Drift  and  Phenotypic 
Evolution  461 

Mutation-Selection  Balance  465 
Quantitative  Trait  Loci  467 

Mapping  Genes  that  Influence 
Quantitative  Characters  467 
Significance  Testing  of  QTLs  470 
Composite  Interval  Mapping  and 
Other  Refinements  471 
What  Have  We  Learned  from  Mapping 
QTLs?  473 

Summary  476 
Problems  479 


SUGGESTIONS  FOR  FURTHER 
READING  483 

ANSWERS  TO  CHAPTER-END 
PROBLEMS  487 

BIBLIOGRAPHY  SOS 

AUTHOR  INDEX  S21 

SUBJECT  INDEX  S2S 


Preface 


Thanks  in  part  to  the  power  of  molecular 
methods,  population  genetics  has  been  rein- 
vigorated. As  some  genome  projects  are 
approaching  closure  and  methods  of  "func- 
tional genomics"  are  scaling  up  to  identify 
the  roles  of  novel  genes,  inevitably  increas- 
ing attention  is  being  paid  to  the  significance 
of  genetic  variation  in  populations.  Nowhere 
is  this  more  evident  than  in  medical  genetics. 
Within  a decade  we  can  expect  that  all  major 
single-gene  inherited  disorders  will  be  iden- 
tified, genetically  mapped,  cloned,  and  char- 
acterized at  a fine  molecular  level.  Health 
professionals  realize  that  this  impressive  feat 
will  have  an  impact  only  on  a small  minority 
of  individuals.  Most  of  the  genetic  variation 
in  disease  risk  is  multifactorial,  which  means 
that  the  risk  is  determined  by  multiple 
genetic  and  environmental  factors  acting 
together.  Killer  diseases  such  as  familial 
forms  of  cancer,  diabetes,  and  cardiovascular 
disease  fall  into  this  category.  The  fact  that 
these  diseases  aggregate  in  families  implies 
that  there  is  probably  a genetic  component, 
but  the  genetic  component  may  differ  from 
one  family  or  ethnic  group  to  another. 
Prompted  by  the  high  incidence  of  multifac- 
torial diseases  as  a group,  the  medical  com- 
munity has  become  acutely  aware  of  the 
need  to  understand  the  basic  structure  of 
genetic  variation  in  populations  in  order  to 
determine  what  aspects  of  the  variation 
cause  disease. 

The  exciting  practical  applications  of 
population  genetics  to  the  analysis  of  multi- 
factorial diseases  have  received  great  atten- 


tion, but  the  scope  of  population  genetics 
actually  is  much  broader.  Population  genet- 
ics provides  the  genetic  underpinning  for  all 
of  evolutionary  biology.  By  "evolution"  we 
mean  descent  with  modification.  Species 
undergo  progressive  genetic  modification  as 
they  adapt  to  their  environments,  and  new 
species  arise  as  a by-product  of  this  process. 
The  intellectual  excitement  of  biological  evo- 
lution arises  from  the  fact  that  it  addresses 
the  fundamental  questions,  "What  are  we?" 
and  "Where  did  we  come  from?" 

Patterns  of  evolutionary  history  are 
recorded  in  DNA  sequences,  and  the  appli- 
cation of  population  genetics  to  interpreting 
DNA  sequences  is  revealing  many  secrets 
about  the  evolutionary  past,  including  the 
history  of  our  own  species.  But  population 
genetics  embraces  much  more  than  the 
analysis  of  evolutionary  relationships.  It  is 
particularly  concerned  with  the  processes 
and  mechanisms  by  which  evolutionary 
changes  are  made.  The  field  is  inherently 
multidisciplinary,  cutting  across  molecular 
biology,  genetics,  ecology,  evolutionary  biol- 
ogy, systematics,  natural  history,  plant 
breeding,  animal  breeding,  conservation 
and  wildlife  management,  human  genetics, 
sociology,  anthropology,  mathematics,  and 
statistics. 

Students  taking  population  genetics  are 
usually  expected  to  have  completed,  or  to  be 
taking  concurrently,  a course  in  differential 
calculus.  While  this  book  assumes  a famil- 
iarity with  the  elementary  notation  for  dif- 
ferentials and  integrals,  it  does  not  require 
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great  mathematical  proficiency.  We  have 
kept  the  mathematics  to  a minimum.  On  the 
other  hand,  some  of  the  most  important 
models  in  population  genetics  require  quite 
advanced  mathematics.  Rather  than  ignore 
these  approaches,  we  have  made  a concert- 
ed effort  to  present  these  models  in  such  a 
way  that  the  assumptions  can  be  under- 
stood and  the  main  results  appreciated 
without  much  mathematics.  References  are 
provided  for  the  interested  reader  to  learn 
more  about  the  details. 

Several  important  changes  distinguish 
the  third  edition  of  Principles  from  the  sec- 
ond edition.  The  level  of  the  treatment  is 
more  tailored  to  the  needs  of  a one-semester 
or  one-quarter  course,  with  the  intended 
audience  being  third-  and  fourth-year 
undergraduates  as  well  as  beginning  gradu- 
ate students.  Population  genetics  is  not  only 
an  experimental  science  but  also  a theoreti- 
cal one.  Special  care  has  been  taken  to 
explain  the  biological  motivation  behind  the 
theoretical  models  so  that  the  models  do  not 
simply  materialize  out  of  thin  air,  and  to 
explain  in  plain  English  the  implications  of 
the  results.  Many  concepts  are  illustrated  by 
numerical  examples,  using  actual  data  wher- 
ever possible.  Special  topics  and  examples 
are  often  set  off  from  the  text  as  boxed  prob- 
lems whose  solutions  are  explained  step  by 
step.  Every  chapter  ends  with  about  20  prob- 
lems, graded  in  difficulty,  and  solutions 
worked  in  full  appear  at  the  end  of  the  text. 

This  edition  of  Principles  is  organized 
into  nine  chapters  that  gradually  build  con- 
cepts from  measuring  variation  and  the  var- 
ious forces  that  influence  genetic  variation 
through  a sequential  progression  to  concepts 
of  molecular  population  genetics  and  quan- 
titative genetics.  The  first  chapter  provides  a 
background  in  basic  genetic  and  statistical 
principles.  We  discuss  the  fundamental  con- 
cepts of  allelism,  dominance,  segregating. 


recombination,  and  population  frequencies. 
The  role  of  model  building  and  testing  in 
population  genetics  is  emphasized.  Chapter 
2 introduces  the  student  to  the  primary  data 
of  population  genetics,  namely,  the  many 
levels  of  genetic  variation.  Chapter  3 is  con- 
cerned with  the  organization  of  genetic  vari- 
ation into  genotypes  in  populations.  Here 
the  Hardy-Weinberg  principle  gets  very 
thorough  coverage,  including  the  cases  of  X- 
linkage  and  multiple  alleles.  Chapter  4 
widens  the  perspective  and  considers  the 
organization  of  genetic  variation  among  spa- 
tially structured  populations.  Population 
substructure  is  measured  by  Wright's  F sta- 
tistics, and  is  presented  in  a way  that  con- 
veys their  biological  meaning.  The  Wahlund 
principle  and  inbreeding  are  also  covered  in 
Chapter  4. 

The  goal  of  population  genetics  is  to 
understand  the  forces  that  have  an  impact 
on  levels  of  genetic  variation.  The  forces  of 
mutation,  recombination,  and  migration  are 
outlined  in  Chapter  5.  Darwinian  selection  is 
the  topic  of  Chapter  6,  including  both  the 
theoretical  foundations  and  empirical  obser- 
vations of  the  dynamics  of  gene-frequency 
change  under  the  action  of  selection.  Hap- 
loid and  diploid  cases  are  developed,  as  are 
the  concepts  of  equilibrium,  stability,  and 
context  dependence.  After  classical  models 
of  mutation-selection  balance  are  developed, 
a series  of  more  complex  scenarios  of  natural 
selection  are  presented. 

Chapter  7 deals  with  random  genetic 
drift.  In  the  absence  of  other  forces,  allele 
and  genotype  frequencies  change  as  a result 
of  random  sampling  from  one  generation  to 
another.  The  Wright-Fisher  model  and  diffu- 
sion approximations  are  presented  in  such  a 
way  that  the  student  gains  an  appreciation 
for  the  importance  of  random  genetic  drift. 
The  process  of  the  coalescence  of  genealogies 
is  an  important  innovation  in  theoretical 
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population  genetics,  and  some  of  the  basic 
concepts  of  coalescence  are  presented  in 
Chapter  7. 

In  Chapter  8 we  cover  the  rapidly  ex- 
panding data  on  molecular  evolutionary 
genetics.  The  unifying  theme  in  the  study  of 
molecular  evolution  is  Kimura's  neutral  the- 
ory, and  a close  examination  is  made  of  the 
correspondence  between  the  data  and  theo- 
ry. This  is  a field  in  which  advances  in  our 
empirical  database  and  statistical  tools  for 
quantifying  and  manipulating  the  data  are 
growing  at  a dizzying  pace.  Our  goal  is  to 
give  the  student  a firm  grasp  of  the  funda- 
mentals, and  a deep  enough  understanding 
of  the  principles  to  identify  important  gaps 
in  our  knowledge.  One  intriguing  aspect  of 
molecular  evolutionary  genetics  is  the  dis- 
covery of  new  phenomena  and  forces  taking 
place  at  the  molecular  level  that  go  beyond 
the  realm  of  classical  population  genetics. 
Multigene  families  and  organelle  genomes 
are  described  in  some  detail  to  illustrate 
these  uniquely  molecular  phenomena. 

Chapter  9 covers  the  problem  of  quanti- 
tative genetics  from  an  evolutionary  perspec- 
tive. A compelling  argument  for  using  quan- 
titative genetics  for  the  study  of  evolution  is 
that  adaptive  evolution  takes  place  at  the 
level  of  the  phenotype,  and  quantitative 
genetics  provides  the  tools  for  understanding 
transmission  of  phenotypic  traits.  Theoretical 
quantitative  genetics  is  given  special  impor- 
tance by  the  paradoxes  it  raises  in  contrasting 
evolution  at  the  levels  of  the  phenotype  and 
of  the  DNA  sequence.  Our  understanding  of 
the  correspondence  between  phenotypic  and 
molecular  differentiation  is  very  incomplete, 
and  our  understanding  of  the  correspon- 
dence between  the  rates  of  morphological 
and  molecular  evolution  is  even  less  well 
developed.  As  in  the  preceding  chapters,  we 
hope  that  the  student  is  left  with  a feeling 
that  there  is  plenty  of  room  for  imaginative 


work  in  this  area.  Population  genetics  is  a 
field  with  a bright  and  expanding  future. 
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he  science  OF  population  genetics  deals  with  Mendel's  laws  and 
other  genetic  principles  as  they  affect  entire  populations  of  organ- 
isms. The  organisms  may  be  human  beings,  animals,  plants,  or 
microbes.  The  populations  may  be  natural,  agricultural,  or  experimental.  The 
environment  may  be  city,  farm,  field,  or  forest.  The  habitat  may  be  soil, 
water,  or  air.  Because  of  its  wide-ranging  purview,  population  genetics  cuts 
across  many  fields  of  modern  biology.  A working  knowledge  has  become 
essential  in  genetics,  evolutionary  biology,  systematics,  plant  breeding,  ani- 
mal breeding,  ecology,  natural  history,  forestry,  horticulture,  conservation, 
and  wildlife  management.  A basic  understanding  of  population  genetics  is 
also  useful  in  medicine,  law,  biotechnology,  molecular  biology,  cell  biology, 
sociology,  and  anthropology. 

Population  genetics  also  includes  the  study  of  the  various  forces  that 
result  in  evolutionary  changes  in  species  through  time.  By  defining  the 
framework  within  which  evolution  takes  place,  the  principles  of  population 
genetics  are  basic  to  a broad  evolutionary  perspective  on  biology.  From  an 
experimental  point  of  view,  evolution  provides  a wealth  of  testable  hypothe- 
ses for  all  other  branches  of  biology.  Many  oddities  in  biology  become  com- 
prehensible in  the  light  of  evolution:  they  result  from  shared  ancestry  among 
organisms,  and  they  attest  to  the  unity  of  life  on  earth. 

Practical  applications  of  population  genetics  are  extensive.  Many  applica- 
tions, particularly  those  relevant  to  human  beings,  also  have  important 
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implications  in  ethics  and  social  policy.  Among  the  applications  of  population 
genetics  in  medicine,  agriculture,  conservation,  and  research  are: 

• Genetic  counseling  of  parents  and  other  relatives  of  patients  with  heredi- 
tary diseases. 

• Genetic  mapping  and  identification  of  genes  for  disease  susceptibility  in 
human  beings,  including  breast  cancer,  colon  cancer,  diabetes,  schizo- 
phrenia, and  so  forth. 

• Implications  of  population  screening  for  carriers  of  disease  genes,  confi- 
dentiality of  results,  and  maintenance  of  health  insurability. 

• Studies  of  the  heritability  of  IQ  score  and  its  implications  for  affirmative 
action,  welfare,  and  other  social  programs. 

• Statistical  interpretation  of  the  significance  of  matching  DNA  types  found 
between  a suspect  and  a blood  or  semen  sample  from  the  scene  of  a 
crime. 

• Design  of  studies  to  sample  and  preserve  a record  of  genetic  variation 
among  human  populations  throughout  the  world. 

• Improvement  in  the  performance  of  domesticated  animals  and  crop 
plants. 

• Organization  of  mating  programs  for  the  preservation  of  endangered 
species  in  zoos  and  wildlife  refuges. 

• Sampling  and  preservation  of  germ  plasms  of  potentially  beneficial 
plants  and  animals  that  may  soon  vanish  from  the  wild. 

• Interpretation  of  differences  in  the  nucleotide  sequences  of  genes  or 
amino  acid  sequences  of  proteins  among  members  of  the  same  or  closely 
related  species. 

The  genetic  and  statistical  principles  underlying  population  genetics  are 
for  the  most  part  simple  and  straightforward,  but  it  may  be  helpful  to  preface 
the  discussion  with  a few  key  definitions  and  concepts. 

GENE  EXPRESSION  AND  GENE  INTERACTION 

Gene  is  a general  term  meaning,  loosely,  the  physical  entity  transmitted 
from  parent  to  offspring  in  reproduction  that  influences  hereditary  traits. 
Genes  influence  human  traits  such  as  hair  color,  eye  color,  skin  color,  height, 
weight,  and  various  aspects  of  behavior — although  most  of  these  traits  are 
also  influenced  more  or  less  strongly  by  environment.  Genes  also  determine 
the  makeup  of  proteins  such  as  hemoglobin,  which  carries  oxygen  in  the  red 
blood  cells,  or  insulin,  which  is  important  in  maintaining  glucose  balance  in 
the  blood.  Genes  can  exist  in  different  forms  or  states.  For  example,  a gene 
for  hemoglobin  may  exist  in  a normal  form  or  in  any  one  of  a number  of 
alternative  forms  that  result  in  hemoglobin  molecules  that  are  more  or  less 
abnormal.  These  alternative  forms  of  a gene  are  called  alleles. 


Genetic  and  Statistical  Background  3 


From  a biochemical  point  of  view,  a gene  corresponds  to  a region  along  a 
molecule  of  DNA  (deoxyribonucleic  acid).  DNA  is  the  genetic  material.  A 
molecule  of  DNA  consists  of  two  strands  wound  around  each  other  in  the 
form  of  a right-handed  helix  (the  celebrated  "double  helix").  Each  strand  is  a 
polymer  of  constituents  called  nucleotides,  of  which  there  are  four,  conven- 
tionally symbolized  A,  T,  G,  and  C according  to  the  nitrogen-rich  base  that 
each  contains  — either  adenine  (A),  thymine  (T),  guanine  (G),  or  cytosine  (C). 
The  paired  strands  are  held  together  by  weak  chemical  bonds  (hydrogen 
bonds)  that  form  between  A and  T at  corresponding  positions  in  opposite 
strands  or  between  G and  C at  corresponding  positions  in  opposite  strands 
(Figure  1.1).  Wherever  one  strand  contains  an  A,  the  other  across  the  way 
contains  a T;  and  wherever  one  strand  contains  a G,  the  other  across  the  way 
contains  a C.  Because  of  the  pairing  of  complementary  bases — A with  T and 
G with  C — a double-stranded  DNA  molecule  contains  an  equal  number  of  A 
and  T nucleotides  as  well  as  an  equal  number  of  G and  C nucleotides.  DNA 
molecules  can  be  very  long.  The  DNA  molecule  in  the  bacterium  E.  coli  is 
about  4.7  million  base  pairs,  that  in  the  largest  chromosome  in  the  fruit  fly 
Drosophila  melanogaster  is  about  65  million  base  pairs,  and  that  in  the  largest 
human  chromosome  is  about  230  million  base  pairs.  Physical  manipulation 
of  such  large  molecules  is  impractical.  In  order  to  be  studied,  they  must  first 
be  broken  into  smaller  pieces. 

Gene  Expression 

Most  genes  code  for  the  polypeptide  chains  that  constitute  proteins.  The 
code  is  the  sequence  of  nucleotides  along  the  DNA.  In  the  decoding  of  the 
nucleotide  sequence  in  DNA  and  also  in  the  synthesis  of  proteins,  several 


Figure  1 . 1 Genes  are  fundamental  units  of  genetic  information  that  corre- 
spond chemically  to  the  sequence  of  nucleotides  in  a segment  of  DNA.  A mole- 
cule of  duplex  DNA  is  composed  of  two  intertwined  strands,  each  of  which 
consists  of  a long  sequence  of  nucleotides.  The  strands  are  held  together  by  pair- 
ing between  the  bases  A and  T in  opposite  strands  and  between  the  bases  G and 
C in  opposite  strands.  The  short  diagonal  lines  indicate  the  paired  bases.  There 
are  10  base  pairs  per  turn  of  the  double  helix.  A typical  gene  consists  of  hun- 
dreds of  thousands  of  nucleotides,  only  a few  of  which  are  shown  here. 
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types  of  RNA  (ribonucleic  acid)  are  essential.  RNA  is  also  a polymer  of 
nucleotides,  each  of  which  carries  a base.  Three  of  the  bases  in  RNA  (A,  C, 
and  G)  are  the  same  as  those  in  DNA.  The  fourth  [uracil  (U)]  is  different! 
When  an  RNA  strand  pairs  with  a complementary  strand  of  DNA,  U in  the 
RNA  pairs  with  A in  the  DNA.  Hence,  the  base-pairing  role  of  U in  RNA  is 
the  same  as  that  of  T in  DNA. 

The  essentials  of  gene  expression  in  the  cells  of  higher  organisms 
(eukaryotes)  are  outlined  in  Figure  1.2.  The  coding  regions  of  the  DNA  in  a 
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Figure  1 .2  Processes  in  gene  expression  in  eukaryotic  cells.  (A)  DNA  regions 
coding  for  the  amino  acids  in  a single  polypeptide  can  be  interrupted  by  non- 
coding regions  (introns).  (B)  When  the  DNA  is  copied  into  RNA  in  transcription 
both  coding  and  noncoding  regions  are  transcribed . However,  the  introns  are 
removed  from  the  transcript  by  processing.  (C)  In  the  messenger  RNA,  the  cod- 
ing regions  are  contiguous.  The  messenger  RNA  is  translated  to  form  the  chain 
ot  linked  ammo  acids  constituting  the  polypeptide. 
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gene,  which  code  for  amino  acids,  are  often  interrupted  by  one  or  more  non- 
coding regions  known  as  intervening  sequences  or  introns.  In  the  first  step  in 
gene  expression  (transcription),  a molecule  of  RNA  is  produced  that  is  com- 
plementary in  base  sequence  to  one  of  the  strands  of  DNA  (Figure  1.2A). 
Every  gene  includes  a regulatory  region  (sometimes  more  than  one)  that 
determines  when  transcription  takes  place,  the  types  of  cells  in  which  it  takes 
place,  and  the  strand  that  is  to  be  transcribed.  Because  of  the  base  pairing 
rules,  a DNA  sequence — say,  3'-ATCG-5' — results  in  a complementary  RNA 
sequence — in  this  example,  5'-UAGC-3'.  Note  that  the  DNA  and  RNA 
strands  each  have  a polarity  or  directionality.  The  terms  5'  and  3'  refer  to  the 
polarity  of  the  strands.  The  5'  end  typically  terminates  with  a free  phosphate 
group  and  the  3'  end  typically  terminates  with  a free  hydroxyl  group  (—OH). 
When  two  strands  of  nucleic  acid  are  paired,  the  polarity  of  each  strand  is 
opposite  to  that  of  the  other.  In  the  duplex  DNA  in  Figure  1-2,  for  example, 
the  left-to-right  polarity  of  one  strand  is  5'-to-3',  whereas  the  left-to-right 
polarity  of  the  partner  strand  is  3'-to-5'.  Similarly,  in  transcription,  the  tem- 
plate DNA  strand  has  a left-to-right  polarity  of  3'-to-5',  whereas  the  RNA 
transcript  has  the  left-to-right  polarity  of  5'-to-3'.  Because  of  the  complemen- 
tary base  pairing  between  DNA  and  RNA  nucleotides,  the  base-sequence 
code  in  DNA  becomes  converted  into  a base-sequence  code  in  RNA.  In  tran- 
scription, the  base  sequence  present  in  the  introns  is  also  faithfully  copied 
into  the  base  sequence  of  the  RNA  transcript. 

The  second  step  in  gene  expression  in  eukaryotes  is  RNA  processing 
(Figure  1.2B).  The  beginning  and  end  of  the  RNA  transcript  are  chemically 
modified  and  the  introns  are  removed  by  splicing  (cutting  and  rejoining).  RNA 
processing  results  in  a molecule  called  messenger  RNA  (mRNA),  in  which  the 
coding  regions  have  been  made  contiguous.  The  regions  in  the  original  RNA 
transcript  that  are  retained  in  the  mature  mRNA  are  called  exons.  The  central 
part  of  the  mRNA  contains  the  spliced  exons  that  code  for  the  amino  acid 
sequence  of  a polypeptide  chain.  The  mRNA  also  includes  exons  upstream  and 
downstream  from  the  protein-coding  region.  The  upstream  region  is  the  5' 
untranslated  region  and  the  downstream  region  is  the  3'  untranslated  region. 

The  final  step  in  gene  expression  is  translation,  in  which  the  mRNA  mol- 
ecule combines  with  ribosomes  and  other  types  of  RNA  molecules  in  the 
cytoplasm  to  produce  the  final  polypeptide  (Figure  1.2C).  In  the  coding 
region  of  the  mRNA,  each  adjacent  group  of  three  nucleotides  constitutes  a 
separate  coding  group  or  codon  that  specifies  which  amino  acid  is  to  be 
incorporated  into  the  polypeptide  chain.  The  ribosome  moves  along  the 
mRNA  in  steps  of  three  nucleotides  (codon  by  codon).  As  each  new  codon 
comes  into  place,  the  correct  amino  acid  is  brought  into  line  and  attached  to 
the  end  of  the  growing  chain  of  amino  acids.  New  amino  acids  are  added  to 
the  growing  chain  until  a codon  specifying  "stop"  is  encountered.  At  this 
point  synthesis  of  the  chain  of  amino  acids  is  finished  and  the  polypeptide  is 
released  from  the  ribosome. 
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In  prokaryotes,  which  includes  bacteria  and  other  organisms  lacking  a 
nucleus,  gene  expression  is  essentially  identical  to  that  in  eukaryotes  except 
for  the  absence  of  RNA  processing.  Genes  in  prokaryotes  do  not  contain 
introns  and  so  splicing  is  unnecessary.  In  prokaryotes,  the  original  RNA  tran- 
script is  used  immediately  as  mRNA  and  translated  into  a polypeptide. 
Because  there  is  no  separate  nucleus,  translation  in  prokaryotes  often  begins 
immediately  when  the  5'  end  of  an  RNA  transcript  comes  off  the  DNA  and 
even  before  transcription  of  the  3'  end  of  the  same  molecule  has  been 
completed. 

The  central  role  of  RNA  in  gene  expression  is  one  of  the  oddities  of  biolo- 
gy that  makes  sense  in  the  light  of  evolution.  That  gene  expression  is  config- 
ured around  RNA  is  a legacy  of  the  earliest  forms  of  life  when  RNA 
molecules  served  both  as  carriers  of  genetic  information  and  as  catalytic  mol- 
ecules. The  role  of  RNA  as  carrier  of  genetic  information  was  gradually 
replaced  by  DNA,  and  the  role  of  RNA  as  catalytic  molecules  was  gradually 
replaced  by  proteins.  At  every  step  along  the  way,  as  the  RNA  world  evolved 
into  the  DNA  world,  the  role  of  RNA  was  indispensable  in  the  processes  of 
information  transfer  and  protein  synthesis,  and  so  the  RNA  intermediates 
became  locked  in  place. 

The  Genetic  Code 

The  genetic  code  is  the  list  of  all  codons  showing  which  amino  acid  each 
codon  specifies.  Table  1.1  shows  the  standard  genetic  code  used  in  nuclear 
genes  in  most  organisms.  A few  organisms  and  some  cellular  organelles, 
such  as  mitochondria,  use  slightly  altered  codes.  The  codons  in  Table  1.1  are 
those  found  in  the  mRNA.  The  amino  acids  are  given  by  three-letter  abbre- 
viations as  well  as  by  conventional  single-letter  abbreviations.  Codon  AUG 
is  the  start  codon  in  polypeptide  synthesis;  it  specifies  methionine  (Met)  at 
the  beginning  of  the  polypeptide  as  well  as  at  internal  positions.  Three 
codons  are  stops  that  result  in  termination  of  polypeptide  synthesis:  UAA, 
UAG,  and  UGA.  The  genetic  code  is  redundant  in  that  most  amino  acids  are 
specified  by  more  than  one  codon.  Most  of  the  redundancy  is  in  the  third 
codon  position. 

A code  for  an  amino  acid  is  twofold  degenerate  if  either  of  two  sequences 
specifies  the  same  amino  acid.  Twofold  degenerate  codes  have  the  pattern  - Y 
or  -R,  where  ••  stands  for  the  bases  in  codon  positions  1 and  2.  The  symbol  Y 
stands  for  any  pyrimidine  base  (either  U or  C);  the  symbol  R stands  for  any 
purine  base  (either  A or  G).  For  example,  CAU  and  CAC  both  code  for  histi- 
dine (His),  fitting  the  pattern  CAY;  and  CAA  and  CAG  both  code  for  gluta- 
mine (Gin),  fitting  the  pattern  CAR.  A code  for  an  amino  acid  is  fourfold 
degenerate  if  any  of  four  sequences  specifies  the  same  amino  acid;  fourfold 
degenerate  codes  have  the  form  -N,  where  N means  any  nucleotide  (U,  C,  A, 
or  G).  For  example,  GUU,  GUC,  GUA,  and  GUG  all  code  for  valine  (Val)! 
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TABLE  1.1  THE  STANDARD  GENETIC  CODE 


Second  nucleotide  in  codon 


U 


c 


A 


C 


UUU ' 

uuc 

Phe  (F) 

UCU  ' 
UCC 

Ser  (S) 

UAU 

UAC 

}Tyr  (Y) 

UGU 

UGC  }Cys(Q 

UUA 

Leu  (L) 

UCA 

UAA 

Stop 

UGA 

Stop 

UUG 

UCG 

UAG 

Stop 

UGG 

Trp  (W) 

CUU 

ecu 

CAU 

}His  (H) 

CGU 

cue 

Leu  (L) 

CCC 

Pro  (P) 

CAC 

CGC 

Arg  (R) 

CUA 

CCA 

CAA 

join  (Q) 

CGA 

CUG 

CCG 

CAG 

CGG 

AUU 

ACU 

AAU 

J Asn  (N) 

AGU 

Ser  (S) 

AUC 

He  CD 

Met  (M*) 

ACC 

Thr  (T) 

AAC 

AGC 

AUA 

AUG 

ACA 

ACG 

AAA 

AAG 

}bys (K) 

AGA 

AGG 

Arg  (R) 

GUU 

GCU 

GAU 

}Asp  (D) 

GGU 

GUC 

/ 

GCC 

Ala  (A) 

GAC 

GGC 

Gly  (G) 

GUA 

Val  (V) 

GCA 

GAA 

}g1u  (E) 

GGA 

GUG 

GCG 

GAG 

GGG 

Note:  Codons  are  nonoverlapping  three-base  sequences  present  in  mRNA,  each  of  which  spec- 
ifies an  amino  acid  in  a polypeptide  chain  or  terminates  synthesis  ("Stop").  The  full  names  of 
the  amino  acids  are  phenylalanine  (Phe),  leucine  (Leu),  isoleucine  (lie),  methionine  (Met), 
valine  (Val),  serine  (Ser),  proline  (Pro),  threonine  (Thr),  alanine  (Ala),  tyrosine  (Tyr),  histidine 
(His),  glutamine  (Gin),  asparagine  (Asn),  lysine  (Lys),  aspartic  acid  (Asp),  glutamic  acid  (Glu), 
cysteine  (Cys),  tryptophan  (Trp),  arginine  (Arg),  and  glycine  (Gly). 


which  fits  the  pattern  GUN.  Note  in  Table  1.1  that  the  code  for  isoleucine  is 
threefold  degenerate  and  those  for  leucine,  arginine,  and  serine  are  each  sixfold 
degenerate. 

The  codons  for  amino  acids  are  not  used  randomly  in  proteins.  There  are 
preferred  codons  for  amino  acids  that  differ  from  one  gene  to  the  next  and 
from  one  organism  to  another.  Codon  preferences  exist  even  within  redun- 
dancy classes.  In  Drosophila,  for  example,  among  codons  for  histidine,  CAC  is 
used  more  than  CAU  in  a ratio  of  about  2:1.  Similarly,  among  codons  for 
glutamine,  CAG  is  used  more  than  CAA  in  a ratio  of  about  3:1.  Another 
example  of  nonrandom  codon  usage  is  the  AUA  codon  for  isoleucine,  which 
tends  to  be  avoided  in  most  proteins  in  most  organisms.  In  Drosophila,  AUU 
and  AUC  are  used  more  than  AUA  in  a ratio  of  about  10:1.  One  evolutionary 
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hypothesis  that  explains  the  avoidance  of  AUA  is  that,  because  of  the  degen- 
eracy of  the  genetic  code,  the  AUA  codon  might  sometimes  be  translated  as 
AUG,  which  codes  for  methionine.  Because  methionine  is  likely  to  change 
protein  structure  radically,  the  mistranslation  would  be  a costly  mistake. 
Through  evolutionary  time,  one  by  one,  the  AUA  codons  in  a messenger 
RNA  become  replaced  with  AUU  or  AUC,  minimizing  this  type  of  misincor- 
poration  error.  This  misincorporation  hypothesis  for  AUA  codon  avoidance 
has  not  been  tested,  but  it  is  testable. 

Alleles 

Alternative  alleles  of  a gene  differ  in  their  sequence  of  nucleotides  (Figure 
1.3).  For  example,  where  one  allele  has  a T-A  base  pair  in  the  DNA,  another 
may  have  a C-G  base  pair  at  the  same  position.  Because  of  redundancy  in  the 
code,  not  all  nucleotide  substitutions  result  in  a replacement  of  one  amino 
acid  for  another.  In  Figure  1.3B,  for  example,  if  a mutation  at  the  third  posi- 
tion in  the  second  codon  (asterisk)  changes  one  pyrimidine  into  the  other,  the 
new  codon  still  codes  for  histidine.  On  the  other  hand,  some  nucleotide  sub- 
stitutions at  the  third  position  do  result  in  amino  acid  replacements.  For 
example,  in  Figure  1.3C,  if  the  third  position  in  the  second  codon  changes 
from  a pyrimidine  to  a purine,  the  codon  changes  from  one  for  histidine  to 
one  for  glutamine.  Most  nucleotide  substitutions  at  codon  positions  one  and 
two  result  in  amino  acid  replacements  (Figure  1.2D). 

Not  all  alleles  differ  by  a mere  nucleotide  substitution.  Relative  to  the  typ- 
ical or  wildtype  allele,  some  alleles  may  have  a deletion  of  a number  of 
nucleotide  pairs  or  an  insertion  into  the  DNA  molecule.  The  number  of 
nucleotides  deleted  or  inserted  may  be  small  (as  few  as  one  nucleotide  pair) 
or  large.  Some  insertions  are  thousands  of  nucleotide  pairs  in  size.  Many 
large  insertions  result  from  the  activity  of  transposable  elements,  which  are 
specialized  sequences  of  DNA  able  to  replicate  and  insert  at  novel  positions 
virtually  anywhere  in  the  DNA  of  the  organism  in  which  they  are  present. 
Alleles  also  may  differ  in  the  number  of  copies  of  short  sequences  present  in 
tandem  arrays  in  the  DNA.  For  example,  near  many  genes  in  human  beings 
are  tandem  copies  of  dinucleotides,  such  as  5'-CACACACA  . . . -3'.  Such  a 
repeating  sequence  is  symbolized  as  (5-CA-3  ')n.  The  number  of  copies  (n)  of 
the  dinucleotide  repeat  often  range  from  fewer  than  ten  to  hundreds,  and  the 
number  of  copies  may  differ  dramatically  from  one  allele  to  the  next.  Some 
alleles  even  differ  from  wildtype  in  having  an  inversion  of  the  nucleotide 
sequence  in  a region  of  DNA. 

Genotype  and  Phenotype 

Within  a living  cell,  genes  are  arranged  in  linear  order  along  microscopic 
threadlike  bodies  called  chromosomes.  A typical  chromosome  may  contain 
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Nucleotide 


Amino  acids  Polypeptide  chain 


Figure  1 .3  Alleles  are  alternative  forms  of  a gene.  (A)  The  arrows  show  how 
the  genetic  information  in  a portion  of  the  nucleotide  sequence  of  DNA  specifies 
the  amino  acid  sequence  in  a portion  of  a polypeptide.  Each  group  of  three  adja- 
cent nucleotides  corresponds  to  one  amino  acid  in  the  polypeptide.  (B,  C,  D) 
Substitution  of  one  nucleotide  for  another  in  the  DNA  (indicated  by  the  aster- 
isks and  heavy  lines)  can  result  in  the  replacement  of  one  amino  acid  for  anoth- 
er in  the  polypeptide. 


several  thousand  genes.  The  position  of  a gene  along  a chromosome  is  called 
the  locus  of  the  gene.  In  most  higher  organisms,  each  cell  contains  two  copies 
of  each  type  of  chromosome.  Such  organisms,  in  which  the  chromosomes  are 
present  in  pairs,  are  said  to  be  diploid.  In  each  pair  of  chromosomes,  one 
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member  is  inherited  from  the  mother  through  the  egg  and  the  other  is  inher- 
ited from  the  father  through  the  sperm.  At  every  locus,  therefore,  diploid 
organisms  contain  two  alleles,  one  each  at  corresponding  positions  in  the 
maternal  and  paternal  chromosomes.  If  the  two  alleles  at  a locus  are 
chemically  identical  (in  the  sense  of  having  the  same  nucleotide  sequence 
along  the  DNA),  the  organism  is  said  to  be  homozygous  at  the  locus  under 
consideration;  if  the  two  alleles  at  a locus  are  chemically  different,  the  organ- 
ism is  said  to  be  heterozygous  at  the  locus.  The  term  gene  is  a general  term 
usually  used  in  the  sense  of  locus. 

Geneticists  make  a fundamental  distinction  between  the  genetic  constitu- 
tion of  an  organism  and  the  physical  or  biochemical  attributes  of  the  organ- 
ism. The  genetic  constitution  of  an  organism  is  called  the  genotype;  genotype 
thus  refers  to  the  particular  alleles  present  in  an  organism  at  all  loci  that  affect 
the  trait  in  question.  For  example,  if  a trait  is  influenced  by  two  genes,  each 
with  two  alleles,  then  there  are  nine  possible  genotypes,  as  follows: 


AA ; BB 

AA-Bb 

AA  ;bb 

An ; BB 

An ; Bb 

An  ;bb 

an ; BB 

an  ;Bb 

aa ; bb 

where  A and  a refer  to  the  alleles  of  the 

first  gene  and  B and  b refer  to  the  alle- 

i Scioto  vvitcn  ii ic  gents  dit  niiKea  ^located  in 

the  same  chromosome),  it  is  sometimes  necessary  to  distinguish  between  the 
genotypes  AB/ ab  and  Ab/ aB,  in  which  case  there  are  ten  possible  genotypes. 

In  contrast  to  genotype,  the  physical  expression  of  a genotype  is  called  the 
phenotype.  Examples  of  phenotypes  include  hair  color,  eye  color,  height, 
weight,  number  of  kernels  on  an  ear  of  corn,  number  of  eggs  laid  by  a hen] 
and  round  versus  wrinkled  pea  seeds.  The  distinction  between  the  genetic 
constitution  of  an  organism  (genotype)  and  the  physical  or  biochemical 
attributes  of  the  organism  (phenotype)  is  particularly  important  in  cases  in 
which  the  environment  can  affect  the  trait;  in  such  cases,  two  organisms  with 
the  same  genotype  can  nevertheless  have  different  phenotypes  because  of 
differences  in  the  environment.  Conversely,  two  organisms  with  the  same 
phenotype  can  have  different  genotypes. 


PROBLEM  1 . 1 If  a gene  in  a diploid  organism  has  m alternative  alleles, 
show  that  the  number  of  possible  genotypes  equals  m(m  + l)/2. 
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ANSWER:  Consider  first  the  heterozygotes.  There  are  m ways  of 
choosing  the  first  allele  and,  having  done  that,  there  are  tti  - 1 ways  of 
choosing  a different  second  allele.  Altogether,  there  are  m(m  - l)/2 
different  heterozygotes.  The  division  by  2 is  necessary  because,  for 
each  heterozygote — say,  A,Aj — it  makes  no  difference  whether  A,  was 
chosen  first  and  Aj  second  or  the  other  way  around.  In  addition  to 
the  heterozygotes,  there  are  m possible  homozygotes.  Hence,  the  total 
number  of  diploid  genotypes  equals  [m(m  - l)/2]  + m = m(m  + l)/2. 


Dominance  and  Gene  Interaction 

Whether  each  genotype  has  a single,  unique  expression  of  the  trait  depends 
on  the  manner  in  which  the  alleles  of  a gene  interact  in  development.  For  the 
alleles  of  one  gene,  dominance  refers  to  the  concealment  of  the  presence  of 
one  allele  by  the  strong  phenotypic  effects  of  another.  For  example,  with  two 
alleles  there  are  three  possible  genotypes: 

AA  Aa  aa 

Several  types  of  dominance  are  distinguished  and  exemplified  in  the  fol- 
lowing examples: 

• Complete  dominance:  A is  completely  dominant  to  a if  the  phenotypes  of 
AA  and  Aa  cannot  be  distinguished. 

• Incomplete  dominance:  A shows  incomplete  dominance  with  respect  to  a if 
the  phenotype  of  Aa  is  intermediate  between  that  of  AA  and  that  of  aa. 
This  situation  is  also  referred  to  as  partial  dominance  or  intermediate 
dominance.  When  the  phenotype  can  be  measured  on  a quantitative 
scale,  for  example,  the  number  of  kernels  on  an  ear  of  corn,  and  the  phe- 
notype of  Aa  is  exactly  the  average  between  that  of  AA  and  that  of  aa, 
then  the  alleles  are  said  to  be  additive  alleles  and  the  type  of  dominance 
is  sometimes  called  semidominance. 

• Codominance:  A and  a are  codominant  if  the  products  of  both  alleles  can 
be  detected  in  Aa  heterozygotes.  Many  alleles  are  codominant  at  the 
level  of  their  protein  products  because  two  different  forms  of  the 
polypeptide,  encoded  by  A and  a,  can  be  detected  in  heterozygotes.  At 
the  level  of  the  DNA  sequences,  all  alleles  differing  in  DNA  sequence  are 
codominant. 
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It  is  important  to  note  that  dominance  is  not  a characteristic  of  alleles  so 
much  as  a characteristic  of  the  manner  in  which  the  phenotype  is  examined. 
An  allele  may  show  complete  dominance  if  the  phenotype  is  examined  in 
one  way,  no  dominance  if  examined  in  another,  and  codominance  if  exam- 
ined in  still  another.  For  example,  the  allele  for  round  pea  seeds  W studied  by 
Gregor  Mendel  is  completely  dominant  to  that  for  wrinkled  seeds  w when 
the  phenotype  "round"  versus  "wrinkled"  is  examined.  The  genetic  defect  in 
wrinkled  seeds  is  the  absence  of  an  enzyme  needed  for  the  synthesis  of  a 
branched-chain  form  of  starch.  Microscopic  examination  reveals  subtle  dif- 
ferences in  the  form  of  the  starch  grains  in  seeds  of  the  three  genotypes:  WW 
seeds  contain  large,  well-rounded  starch  grains,  retain  water  and  shrink  uni- 
formly as  they  ripen,  so  the  seeds  do  not  become  wrinkled;  ww  seeds  lack  the 
branched-chain  starch  and  are  irregular  in  shape  because  the  ripening  seeds 
lose  water  more  rapidly  and  shrink  unevenly.  However,  heterozygous  tNw 
seeds  have  starch  grains  that  are  intermediate  in  shape  even  though  the  seeds 
shrink  uniformly  and  show  no  wrinkling.  Therefore,  at  the  level  of  the  starch 
grains,  there  is  incomplete  dominance  of  W and  w because  the  starch  grains 
in  the  heterozygotes  are  intermediate  between  the  two  homozygotes.  Fur- 
thermore, the  difference  in  DNA  sequence  between  W and  w can  readily  be 
detected  with  modern  methods,  so  that  W and  w are  codominant  at  the  level 
of  DNA  sequence. 

For  traits  affected  by  more  than  one  gene,  the  relation  between  geno- 
type and  phenotype  depends  not  only  on  the  degree  of  dominance  of  the 
alleles  of  each  gene  but  also  on  the  type  of  interaction  between  the  genes  in 
development.  For  example,  suppose  that  the  trait  in  question  is  degree  of 
pigmentation  and  that  pigmentation  is  determined  by  two  alleles  of  each 
of  two  genes,  say,  A,  a and  B,  b.  Suppose  further  that  the  total  amount  of 
pigment  in  an  organism  results  from  the  total  number  of  A and  B alleles 
present,  each  of  which  adds  a single  unit  of  pigmentation  to  the  pheno- 
type.  Then,  as  shown  in  Table  1.2,  there  are  only  five  possible  levels  of  pig- 
mentation (0  through  4)  and  genotypes  aa  BB,  Aa  Bb,  and  AA  bb  all  have  the 
same  phenotype.  Because  each  uppercase  allele  adds  the  same  quantity  to 
the  total  phenotype,  the  type  of  gene  interaction  in  Table  1.2  is  said  to  be 
additive. 

Segregation  and  Recombination 

The  essential  mechanism  of  inheritance  was  established  by  Gregor  Mendel 
(1822-1884)  in  experiments  with  garden  peas  carried  out  in  the  years  1856  to 
1863  in  a small  garden  plot  next  to  the  monastery  in  which  he  lived.  Mendel 
showed  that  the  alleles  of  each  gene  segregate  from  one  another  in  the  for- 
mation of  reproductive  cells  or  gametes.  Because  of  segregation,  heterozy- 
gous genotypes  form  equal  numbers  of  gametes  containing  each  allele. 
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TABLE  1.2  A MODEL  OF  THE  ADDITIVE  GENE  ACTION3 

Genotype 

Amount  of  pigmentation1 

AA  BB 

4 

Aa  BB,  AA  Bb 

3 

aa  BB,  Aa  Bb,  AA  bb 

2 

An  bb,  aa  Bb 

1 

aa  bb 

0 

“At  left  are  shown  the  nine  possible  genotypes  of  two  genes  with  two  alleles  of  each  gene.  At 
right  is  shown  the  amount  of  pigmentation  expected  in  each  genotype  when  it  is  assumed 
that  each  allele  designated  by  an  uppercase  letter  is  responsible  for  producing  a certain 
amount  of  pigment. 

^Measured  as  an  increase  in  pigmentation  over  that  in  aa  bb  genotypes. 


Furthermore,  because  gametes  unite  at  random  in  fertilization,  the  following 
are  the  results  of  simple  Mendelian  segregation: 

• AA  x AA  matings  produce  all  AA  progeny. 

• AA  x Aa  matings  produce  ]/2  AA  and  V2  Aa  progeny. 

• AA  x aa  matings  produce  all  Aa  progeny. 

• AaxAa  matings  produce  V4  AA,  V2  Aa,  and  V4  aa  progeny. 

• Aaxaa  matings  produce  V2  Aa  and  V2  aa  progeny. 

• aaxaa  matings  produce  all  aa  progeny. 

The  physical  basis  of  Mendelian  segregation  is  that  the  maternal  and 
paternal  pairs  of  chromosomes  are  separated  into  different  cells  in  the  forma- 
tion of  gametes.  Prior  to  their  separation,  the  maternal  and  paternal  chromo- 
somes associate  intimately  all  along  their  length  and  alleles  may  be 
interchanged  in  the  process  of  recombination  (Figure  1.4).  The  interchange  of 
parts  takes  place  after  the  chromosomes  have  replicated,  and  only  two  of  the 
four  chromosome  strands  participate  in  any  one  exchange.  Recombination 
results  in  the  creation  of  allele  combinations  different  from  either  parental 
chromosome.  In  Figure  1.4,  the  A b and  a B combinations  are  recombinant, 
whereas  the  A B and  a b combinations  are  parental  (nonrecombinant).  There- 
fore, a single  exchange  between  parental  chromosomes  results  in  two  recom- 
binant and  two  nonrecombinant  gametes. 

In  organisms  with  an  XX-XY  chromosomal  mechanism  of  sex  determina- 
tion, Mendelian  segregation  randomizes  the  sex  ratio  at  fertilization.  In  mam- 
mals and  many  other  animals,  sex  is  determined  by  sex  chromosomes:  males 
have  an  X and  a Y chromosome,  and  females  have  two  X chromosomes.  In 
males,  the  X and  Y chromosomes  segregate,  yielding  equal  proportions  of 
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Parents 

1 


1/4 


1/4 


1/4 


1/4 


Gametes 

Figure  1 .4  Recombination  results  from  a physical  interchange  of  parts 
between  chromosomes.  New  combinations  of  alleles  are  created  that  differ  from 
either  parental  chromosome.  The  physical  interchange  of  parts  takes  place  in 
gamete  formation  after  the  chromosomes  have  replicated,  and  only  two  of  the 
four  chromosome  strands  participate  in  any  one  exchange. 


Genetic  and  Statistical  Background  15 


X-bearing  and  Y-bearing  sperm.  If  both  types  of  sperm  are  equally  able  to  fer- 
tilize eggs,  then  random  union  of  sperm  with  eggs  yields  V2  XX  (female)  and 
V2  XY  (male)  chromosome  constitutions. 


PROBABILITY  IN  POPULATION  GENETICS 

The  basic  concepts  of  probability  needed  for  elementary  population  genetics 
are  quite  straightforward.  They  will  be  introduced  with  the  concrete  example 
of  genetic  segregation  in  Figure  1.5,  which  deals  with  the  progeny  of  the  mating 

(A)  Addition  rule 

Mating:  Aa  x Aa 

Offspring:  ^AA  + ^Aa  + | aa 

A-  means  "Offspring  either  AA  or  Aa" 

Pr  (A-)  = \(AA)  + \(Aa)  = \ 

(B)  Multiplication  rule 

Birth  Order 


Sibship 

1 

2 

3 

Probability 

1 

A- 

A- 

A- 

3J  3_27 

4 4 4 64 

2 

A- 

A- 

aa 

zxz*z=4 

4 4 4 64 

3 

A- 

aa 

A- 

Z X 7 X T = -tt 

4 4 4 64 

4 

aa 

A- 

A- 

Z x 7 x 7 = ZZ 

4 4 4 64 

5 

A- 

aa 

aa 

zxlxi=A 

4 4 4 64 

6 

aa 

A- 

aa 

Z x Z x 7 = iz 

4 4 4 64 

7 

aa 

aa 

A- 

\x\x\  = ^ 

4 4 4 64 

8 

aa 

aa 

aa 

Mxi=i 

4 4 4 64 

Figure  1 .5  Basic  concepts  of  probability  illustrated  by  Mendelian  segregation 

in  the  mating  AaxAa.  The  elementary  outcomes  of  the  mating  are  the  possible 
genotypes  of  each  progeny — AA,  Aa,  and  aa — and  these  are  realized  with  proba- 
bilities y4,  y2,  and  y4,  respectively.  (A)  The  compound  event  A-  consists  of  the 
two  elementary  outcomes  AA  and  Aa,  and  the  probability  of  A-  is  the  sum  of 
the  probabilities  of  these  elementary  outcomes  (addition  rule).  (B)  The  possible 
distributions  of  genotypes  A-  and  aa  in  sibships  of  size  three  offspring.  Succes- 
sive births  are  independent,  and  so  the  probability  of  any  sibship  equals  the 
product  of  the  probabilities  for  each  birth  separately  (multiplication  rule). 
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Aa  x Aa.  Considerations  in  probability  always  begin  with  an  experiment  of 
some  kind.  The  experiment  may  be  either  a real  experiment  or  a conceptual 
experiment.  In  Figure  1.5,  it  is  a conceptual  experiment  in  which  Aa  is  crossed 
with  Aa.  In  probability  calculations,  it  is  also  necessary  to  define  all  possible 
outcomes  of  the  experiment.  The  outcomes  are  called  elementary  outcomes 
because  they  are  defined  in  such  a way  that,  in  any  repetition  of  the  experi- 
ment, one  and  only  one  of  the  elementary  outcomes  must  be  realized.  For 
example,  if  we  are  interested  in  the  genotypes  among  the  progeny  of  the  mat- 
ing Aa,  the  possible  elementary  outcomes  for  each  offspring  are  either  AA,  Aa, 
or  aa.  (Note  that,  in  defining  these  as  the  elementary  outcomes,  we  are  ignor- 
ing the  possibility  of  either  A or  a mutating  to  a novel  allele.)  To  proceed  fur- 
ther, we  must  assign  to  each  elementary  outcome  a probability,  a number 
between  0 and  1 that  measures  how  much  confidence  we  have  that  the  out- 
come will  be  realized.  The  probabilities  assigned  to  the  outcomes  are  based  on 
genetic  reasoning,  intuition,  or  experience.  One  requirement  of  the  assigned 
probabilities  is  that  the  probabilities  of  all  the  elementary  outcomes  must  add 
to  1;  this  is  the  mathematical  consequence  of  requiring  that  one  of  the  elemen- 
tary outcomes  must  be  realized.  For  example,  if  there  are  three  elementary  out- 
comes, and  all  are  equally  probable,  then  each  has  a probability  of  V3.  In  Figure 
1.5,  the  probabilities  assigned  to  the  elementary  outcomes  AA,  Aa,  and  aa  are 
V4,  V2,  and  i/4,  respectively,  because  these  are  the  relative  proportions  of  the 
three  progeny  genotypes  expected  from  Mendelian  segregation. 

The  Addition  Rule 

An  outcome  of  a conceptual  experiment  is  an  event.  The  distinction  between 
an  event  and  an  elementary  outcome  is  that  an  event  can  include  more  than 
one  elementary  outcome.  For  example,  in  Figure  1.5 A,  the  event  "the  off- 
spring has  at  least  one  copy  of  the  dominant  A allele"  consists  of  two  ele- 
mentary outcomes,  namely,  genotypes  AA  and  Aa.  This  event  may  be  sym- 
bolized A~,  where  the  dash  indicates  that  the  unspecified  allele  may  be  either 
A or  a.  For  events  defined  in  terms  of  elementary  outcomes,  the  probability 
of  an  event  equals  the  sum  of  the  probabilities  of  the  elementary  outcomes 
included  in  the  event.  In  the  present  example, 

Pr(A-)  = Pr  (AA)  + Pr(Aa)  = % + i/2  = 3/4 

More  generally,  two  events  are  mutually  exclusive  if  they  cannot  be  real- 
ized simultaneously.  The  addition  rule  states  that,  for  mutually  exclusive 
events,  the  probability  that  either  one  or  the  other  is  realized  equals  the  sum 
of  the  probabilities  of  the  separate  events. 

The  Multiplication  Rule 

Figure  1.5B  shows  all  possible  genotypes  of  sibships  of  three  offspring  from 
the  mating  Aa  x Aa,  with  each  offspring  classified  as  A-  versus  aa.  (A  sibship 


Genetic  and  Statistical  Background  1 7 


is  a group  of  brothers  and  sisters.)  The  probability  of  A-  in  any  particular 
birth  is  3/4  and  that  of  aa  is  y4.  The  probabilities  at  the  right  are  the  overall 
probabilities  for  each  of  the  sibships.  They  are  obtained  by  multiplication  of 
the  probability  for  each  birth  because  successive  births  are  independent, 
which  means  that  the  genotype  of  any  birth  has  no  effect  on  the  genotype  of 
any  other  birth.  Because  of  the  independence,  among  the  3/4  of  the  sibships 
with  A - in  the  first  birth,  3/4  will  have  A-  in  the  second  birth,  and  among 
the  3/4  x 3/4  of  the  sibships  with  A-  in  the  first  two  births,  3/4  will  have 
A-  in  the  third  birth.  Therefore,  the  overall  probability  of  three  A-  births 
is  3/4  x 3/4  x 3/4.  The  reasoning  for  the  other  types  of  sibships  is  similar.  More 
generally,  the  multiplication  rule  states  that,  whenever  two  events  are  inde- 
pendent, the  probability  of  their  joint  realization  is  the  product  of  the  prob- 
abilities of  their  being  realized  separately. 

Repeated  Trials 

The  sibships  in  Figure  1.5B  are  an  example  of  repeated  trials  of  a conceptu- 
al experiment.  Repeated  trials  are  encountered  frequently  in  probability. 
They  govern  tosses  of  a coin  or  dice,  deals  of  cards,  successive  spins  of  a 
roulette  wheel,  and  so  forth.  Repeated  trials  are  also  important  in  population 
genetics  because  successive  offspring  of  a mating  are  independent  events 
and  thus  repeated  trials.  Furthermore,  it  is  apparent  from  Figure  1.5B  that  the 
different  birth  orders  are  mutually  exclusive:  any  sibship  can  have  one  and 
only  one  birth  order  of  A-  or  aa.  Because  the  birth  orders  are  mutually  exclu- 
sive, their  probabilities  may  be  combined  by  the  addition  rule.  Flence,  the 
composite  events  below  have  the  following  probabilities: 

Pr(two  A-  and  one  aa)  = 9/64  + 9/64  + 9/64  = 27/64 

Prfone  A-  and  two  aa)  = 3/64  + 3/64  + 3/64  = 9/64 


Note  in  Figure  1.5B  that,  when  the  sibships  with  the  same  number  of  A- 
and  aa  genotypes  are  combined,  the  overall  probabilities  are  given  by  succes- 


sive  terms  in  the  expansion  of: 

(3/4d-  + y4«fl)3  = ix(3/4)3 

>> 

i 

N 

I 

>> 

l 

+ 3 x mw 

A-  A-  aa 

+ 3 x (3/4)1(1/4)2 

A-  aa  aa 

+ 1 x (V4)3 

aa  aa  aa 

The  coefficients  1 : 3 : 3 : 1 are  the  number  of  combinations  in  which  each 
triad  of  genotypes  can  be  born:  1 for  A-  A-  A-,  3 for  A-  A-  aa  (because  the  aa 
genotype  can  be  born  either  first,  second,  or  third),  and  so  forth.  Each  power 
of  3/4  and  y4  is  the  probability  that  any  one  of  the  birth  orders  will  be  realized; 
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for  example,  (3/4)2(1/4)1  is  the  probability  that  any  sibship  with  two  A-  and  one 
aa  genotype  will  be  realized. 

In  all  cases  of  repeated  and  independent  trials,  the  overall  probabilities 
are  given  by  analogous  expansions.  Suppose  that  any  one  trial  may  result  in 
either  of  two  mutually  exclusive  events,  A or  B,  and  that  the  probability  of 
event  Aisp  and  that  of  event  B is  q (with  p + q = 1).  Among  a total  of  n inde- 
pendent trials,  what  is  the  probability  that  A is  realized  exactly  r times  and  B 
is  realized  exactly  n — r times?  By  the  multiplication  rule,  any  particular  com- 
bination of  r As  and  n-r  Bs  has  a probability  prqn~r.  Deducing  the  total  num- 
ber of  combinations  of  r As  and  n - r Bs  is  a little  less  obvious,  but  it  is  given 
by  the  coefficient  of  the  term  prq"  ' in  the  expansion  of  (p  + q)“ , which  equals 

/?! 

r\(n-r)\  ^ 

where  the  exclamation  point  means  the  factorial,  the  product  of  all  integers 
from  1 through  the  number  in  question.  For  example,  n\  = Ix2x3x  •••  x n. 
For  consistency,  the  number  0!  is  defined  as  0!  = 1. 

Equation  1.1  is  often  called  a binomial  coefficient  because  it  arises  in 
the  expansion  of  the  two  terms  (p  + q)n.  To  understand  the  reason  why 
Equation  1.1  yields  the  correct  number  of  combinations  of  r As  and  ( n - r ) 
Bs,  first  consider  what  the  n\  means.  It  is  the  total  number  of  ways  that  any 
set  of  n objects  can  be  arranged  in  order.  There  are  n ways  to  choose  the  first 
object  and,  having  chosen  the  first,  n - 1 ways  to  choose  the  second  and, 
having  chosen  the  first  two,  n-2  ways  to  choose  the  third,  and  so  on,  yield- 
ing n x (n  - 1)  x (n  - 2)  x ■ • • x 1 = n!.  Furthermore,  for  each  arrangement  of 
n objects  of  which  r are  As  and(n  - r)  are  Bs,  there  are  r\  ways  to  arrange  the 
As  among  themselves  and  (n  — r)l  ways  to  arrange  the  Bs  among  themselves, 
for  a total  of  r\  x (n  - r)\  arrangements.  Because  each  of  the  n\  combinations 
of  r As  and  (n  - r)  Bs  includes  r\  x ( n - r) ! equivalent  arrangements  of  the  As 
and  Bs,  the  total  number  of  different  arrangements  of  r As  and  (n  - r)  Bs 
equals  the  ratio  given  in  Equation  1.1. 

Equation  1.1  gives  the  number  of  different  arrangements  of  r As  and  (n  - r ) 
Bs.  Each  arrangement  has  a probability  given  by  prq”~r.  Therefore,  using  the 
addition  rule,  the  probability  that  n repeated  trials  yields  r realizations  of  A 
and  (n  - r ) realizations  of  B equals 


As  an  example  of  the  use  of  Equation  1.2,  consider  the  probability  that  a 
sibship  of  12  offspring  from  the  mating  Aa  x Aa  perfectly  matches  the 
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expected  Mendelian  ratio  of  9 A-  and  3 cm.  In  this  case,  p = 3/4,  q = V4,  n = 12, 
r = 9,  and  n-r  = 3.  The  required  probability  from  Equation  1.2  is  therefore 

— f-1  (-)  =220x0.0751x0.0156  = 0.258 
9!3! U / UJ 

The  implication  of  this  calculation  is  that,  whereas  the  "expected"  ratio  is 
9 A-  : 3 an,  only  a little  more  than  25%  of  such  sibships  actually  have  the 
expected  distribution. 


PROBLEM  1 .2  Suppose  that  a society  decided  to  limit  the  number 
of  males  by  passing  a law  denying  further  reproduction  to  any 
woman  who  gives  birth  to  a male  child.  Given  a ratio  of  males  to 
females  at  birth  of  1 : 1,  how  would  such  a law  affect  the  sex  ratio? 
Suppose  further  that,  in  practice,  any  woman  who  has  a female  child 
voluntarily  terminates  further  reproduction  with  probability  p.  In  this 
case,  what  is  the  proportion  of  males  in  sibships  of  size  n? 


ANSWER  The  law  would  have  no  effect  on  the  sex  ratio.  To  under- 
stand why,  consider  the  first  birth  across  the  entire  population.  The 
sex  ratio  among  these  offspring  must  be  50%  males.  Consider  now  the 
second  birth.  The  sex  ratio  among  these  offspring  must  also  be  50% 
males.  Indeed,  the  sex  ratio  in  any  birth  must  be  50%  males,  and  so 
this  is  the  sex  ratio  in  the  population  of  births  as  a whole.  In  regard  to 
the  second  part  of  the  problem,  note  that  sibships  of  size  n can  be  sep- 
arated into  two  classes:  those  in  which  the  final  birth  is  a male  (and 
the  mother's  further  reproduction  is  denied)  and  those  in  which  the 
final  birth  is  a girl  (in  which  the  mother  voluntarily  stops  reproducing 
with  probability  p).  These  types  of  sibships  occur  in  the  ratio  V2: 
p/2,  which  means  in  the  proportions  1/(1  + p)  and  p/(l  + p),  respec- 
tively. The  first  type  of  sibship  has  a proportion  of  males  of  1/n  and 
the  second  has  a proportion  of  males  of  0.  Hence,  the  proportion  of 
males  as  a function  of  sibship  size  equals  (1/n)  x [1/(1  + p)]  + 0 x 
[p/ (1  + p)]  = 1 / n(l  + p).  Note  that,  for  p = 0,  the  proportion  of  males  as 
a function  of  sibship  size  decreases  according  to  the  series  1,  y2,  V3, 

V4, Nevertheless,  the  sex  ratio  in  the  population  as  a whole  equals 

y2  for  this  and  any  other  value  of  p. 
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PHENOTYPIC  DIVERSITY  AND  GENETIC  VARIATION 

One  of  the  universal  attributes  of  natural  populations  is  that  organisms  dif- 
fer in  phenotype  with  respect  to  many  traits.  Phenotypic  diversity  in  many 
traits  is  impressive  even  with  the  most  casual  observation.  Among  human 
beings,  for  example,  there  is  diversity  with  respect  to  height,  weight,  body 
conformation,  hair  color  and  texture,  skin  color,  eye  color,  and  many  other 
physical  and  psychological  attributes  or  skills.  Population  genetics  must  deal 
with  this  phenotypic  diversity,  and  especially  with  that  portion  of  the  diver- 
sity that  is  caused  by  differences  in  genotype.  In  particular,  the  field  of  pop- 
ulation genetics  has  set  for  itself  the  tasks  of  determining  how  much  genetic 
variation  exists  in  natural  populations  and  of  explaining  its  origin,  mainte- 
nance, and  evolutionary  importance.  Genetic  variation,  in  the  form  of  multi- 
ple alleles  of  many  genes,  exists  in  most  natural  populations.  In  most  sexu- 
ally reproducing  populations,  no  two  organisms  (barring  identical  twins  or 
other  multiple  identical  births)  can  be  expected  to  have  the  same  genotype 
for  all  genes.  Thus,  it  becomes  important  to  describe  how  alleles  in  natu- 
ral populations  are  organized  into  genotypes— to  determine,  for  example, 
whether  alleles  of  the  same  or  different  genes  are  associated  at  random. 

Allele  Frequencies  in  Populations 

Much  of  the  phenotypic  variation  in  natural  populations  does  not  yield  sim- 
ple Mendelian  segregation  ratios  such  as  1 : 1 or  3 : 1 in  pedigrees.  Some  dif- 
ferences in  phenotype  are  environmental  in  origin  and  so  are  not  expected  to 
show  Mendelian  segregation.  However,  simple  Mendelian  segregation  is  not 
usually  observed  even  for  traits  whose  expression  is  influenced  more  or  less 
strongly  by  genetic  factors.  Although  the  underlying  genetic  factors  do  seg- 
regate in  pedigrees  in  Mendelian  fashion,  the  segregation  is  concealed  by 
several  complications.  First,  environmental  effects  on  the  trait  may  be  strong 
enough  to  mask  the  genetic  segregation.  Second,  genetic  effects  on  many 
traits  are  determined  by  the  joint  effects  of  the  alleles  of  two  or  more  genes, 
and  the  segregation  of  any  one  gene  in  a pedigree  may  be  obscured  by  the 
segregation  of  others. 

On  the  other  hand,  some  phenotypic  diversity  in  populations  does  show 
simple  Mendelian  segregation.  In  the  snapdragon  Antirrhinum  majus,  for 
example,  whether  the  flower  color  is  red,  pink,  or  white  is  determined  by  the 
alleles  I and  i of  a single  gene.  The  genotypes  II,  Ii,  and  ii  have  red,  pink,  and 
white  flowers,  respectively,  an  example  of  incomplete  dominance. 

Populations  containing  both  the  I and  i alleles  will  include  plants  whose 
flowers  are  red  (II),  pink  (Ii),  or  white  (ii)  in  proportions  determined  by  the 
allele  frequencies  of  the  I and  i alleles  in  the  population  as  well  as  by  the 
manner  in  which  the  alleles  are  united  in  fertilization.  By  the  allele  frequen- 
cy of  a specified  allele,  we  mean  the  proportion  of  all  alleles  of  the  gene  that 
are  of  the  specified  type.  To  take  a hypothetical  example,  suppose  400 
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members  of  a population  were  classified  as  to  flower  color  and  the  finding 
was:  165  red,  190  pink,  and  45  white.  Because  the  flower  color  reveals  the 
genotype,  we  may  infer  that  the  sample  of  400  includes  165  II,  190  Ii,  and  45  ii 
genotypes.  The  observed  numbers  of  I and  i alleles  are  therefore: 

I:  2 x 165  + 190  - 520 
i:  190  + 2 x 45  = 280 

The  factors  of  2 are  included  for  the  homozygous  genotypes  because  each 
II  genotype  contains  two  I alleles  and  each  ii  genotype  contains  two  i alleles. 
The  total  number  of  alleles  in  the  sample  equals  2 x 400  = 800.  Therefore,  if 
we  let  p represent  the  frequency  of  the  I allele  and  q represent  the  frequency 
of  the  i allele  (with  p + q = 1 because  these  are  the  only  alleles  of  the  gene  in 
question),  then  we  can  estimate  p and  q from  the  observations  as: 

p = 520/800  = 0.65 

q = 280/800  = 0.35 

Note  that,  if  the  I and  i alleles  were  combined  into  genotypes  at  random, 
the  expected  frequencies  of  three  genotypes  can  be  calculated  from  the  rule 
for  repeated  trials  by  expanding  the  binomial  (p  I + q i)2  = p2  II  + 2 pq  Ii  + q2  ii. 
Therefore,  assuming  random  combination  into  genotypes,  the  expected  num- 
bers of  the  three  genotypes  are: 

II:  (0.65)2  x 400  = 169 

Ii:  2 x 0.65  x 0.35  = 182 

ii:  (0.35)2  x 400  = 49 

Hence,  the  observed  numbers  in  this  hypothetical  population  are  very 
close  to  those  expected  with  random  combinations  of  alleles.  The  proportions 
p2,  2 pq,  and  q2  for  the  three  genotypes  when  two  alleles  are  combined  at  ran- 
dom constitutes  the  Hardy-Weinberg  principle,  which  is  one  of  the  basic 
principles  in  population  genetics.  The  Hardy-Weinberg  principle  is  discussed 
in  detail  in  Chapter  2. 


PROBLEM  1 .3  Suppose  that  a random  sample  of  400  snapdragons 
from  a population  includes  185  red,  150  pink,  and  65  white.  Estimate 
the  allele  frequency  p of  I and  q of  i.  Assuming  random  combina- 
tions of  alleles  in  the  genotypes,  what  are  the  expected  numbers  of 
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the  three  genotypes?  Do  the  observed  data  seem  to  fit  the  expecta- 
tions? 


ANSWER  Among  the  total  of  800  alleles,  the  observed  number  of  / 
alleles  is  2 x 185  + 150  = 520  and  that  of  i alleles  is  150  + 2 x 65  = 280. 
Therefore,  p = 520/800  = 0.65  and  q = 280/800  = 0.35.  Note  that  the 
estimated  allele  frequencies  are  the  same  as  above,  even  though  the 
observed  numbers  of  the  genotypes  are  different.  With  random  com- 
binations of  alleles  in  the  genotypes,  the  expected  numbers  are  again 
169  red,  182  pink,  and  49  white.  Compared  to  the  observations,  there 
appear  to  be  too  many  homozygous  genotypes  and  too  few  heterozy- 
gous genotypes.  (A  statistical  method  for  deciding  whether  the  fit  is 
satisfactory  or  not  is  discussed  in  Chapter  2.) 


Parameters  and  Estimates 

In  the  discussion  of  flower  color  in  snapdragons,  we  made  a subtle  distinc- 
tion between  the  actual  allele  frequency  of  the  / allele  (designated  p)  and  the 
estimated  allele  frequency  of  the  / allele  (symbolized  p).  The  distinction  is 
necessary  whenever  an  experimenter  makes  inferences  about  an  entire  pop- 
ulation from  an  examination  of  a random  sample  from  the  population. 
Quantities  used  in  describing  entire  populations  are  parameters.  In  the  snap- 
dragon example,  the  parameter  of  interest  is  the  allele  frequency  p of  / in  the 
entire  population.  Because  we  only  have  access  to  a sample  of  400  organisms 
from  the  population,  the  true  value  of  p is  unknown.  The  best  we  can  do  is 
make  an  estimate  of  p based  on  a sample,  hoping  that  the  sample  is  repre- 
sentative of  the  population  as  a whole.  The  estimate  obtained  from  the  sam- 
ple is  designated  p to  emphasize  that  it  is  an  estimate  rather  than  the  true 
value.  In  this  book,  whenever  it  is  necessary  to  distinguish  parameters  from 
their  estimates,  we  use  unembellished  symbols  for  parameters  (for  example 
V for  the  unknown  frequency  of  an  allele  in  a specified  population)  and  the 
same  symbol  with  a circumflex  for  the  estimated  value  (in  this  example  p). 

The  Standard  Error  of  an  Estimate 

The  distinction  between  a parameter  and  an  estimate  is  important  because 
different  samples  may  yield  different  values  of  the  estimate  for  the  same  rea- 
son that  different  sibships  may  yield  different  segregation  ratios,  namely, 
chance  variation  from  one  repeated  trial  to  the  next.  The  estimation  of  an 
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allele  frequency  can  be  treated  as  repeated  trials  by  supposing  that  the  alle- 
les are  sampled  at  random,  one  by  one,  from  a very  large  population.  In  the 
snapdragon  example,  there  are  800  alleles  sampled.  If  the  allele  frequency  of 
1 has  the  true  value  p = 0.65,  then  the  repeated-trials  interpretation  implies 
that  all  possible  outcomes  of  800  trials  have  probabilities  given  by  successive 
terms  in  the  expansion  of  (0.65  I + 0.35  i)800.  This  is  not  an  expansion  that  one 
would  want  to  do  by  hand,  but  the  binomial  expression  makes  evident  the 
underlying  random-sampling  process  that  accounts  for  variation  in  the  esti- 
mate of  p from  one  sample  of  800  alleles  to  the  next. 

Unless  p is  quite  close  to  0 or  quite  close  to  1,  there  is  a convenient  approx- 
imation to  the  binomial  expansion  (p  I + cj  i)",  where  n is  the  number  of  alleles 
sampled.  As  n becomes  large,  the  distribution  of  p approaches  the  familiar 
bell-shaped  curve  called  the  normal  distribution.  The  normal  distribution  fea- 
tures prominently  in  the  analysis  of  traits  determined  jointly  by  multiple 
genetic  and  environmental  factors  and  it  is  discussed  in  detail  in  that  context 
(Chapter  9).  For  present  purposes,  it  is  sufficient  to  note  that  the  degree  to 
which  the  values  of  p are  clustered  around  the  overall  average  depends  on  a 
quantity  called  the  standard  error: 


1.3 


where  q = 1 - p.  If  the  sampling  and  estimation  of  p were  repeated  many 
times  using  the  same  population,  then  the  values  of  p would  be  expected  to 
be  clustered  symmetrically  around  p according  to  the  standard  error  as  fol- 
lows: 

• Approximately  68%  of  the  estimates  p lie  within  plus  or  minus  one  stan- 
dard error  of  p. 

• Approximately  95%  of  the  estimates  p lie  within  two  standard  errors  of  p. 

• Approximately  99.7%  of  the  estimates  p lie  within  three  standard  errors 


of  p. 


To  put  the  matter  in  another  way,  with  repeated  sampling,  32%  of  the  esti- 
mates would  be  expected  to  differ  from  the  true  value  by  more  than  one  stan- 
dard error,  5%  by  more  than  two  standard  errors,  and  only  0.3%  by  more 
than  three  standard  errors. 

As  an  illustration  of  the  variation  among  repeated  estimates  of  p,  Figure 
1.6  shows  the  values  of  p obtained  in  100  repetitions  of  the  experiment  of 
sampling  800  alleles  from  a large  population  in  which  the  true  allele  frequen- 
cy is  p = 0.65.  Each  of  the  100  samples  was  created  by  computer  simulation 
using  a random-number  generator  that  yielded  a 1 with  probability  0.65  and 
a 0 with  probability  0.35.  For  each  sample  of  800,  therefore,  the  estimate  p 
equals  the  number  of  Is  in  the  sample  divided  by  800.  As  is  evident  in 
Figure  1.6,  the  distribution  of  p values  is  more  or  less  bell-shaped  but  not 
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-3  -2  -1  0 +1  +2  +3 

i l l 1 1 1 1 


Figure  1 .6  Estimates  of  allele  frequency  based  on  100  samples,  each  of  size 
400  diploid  organisms,  from  a population  in  which  the  actual  allele  frequen- 
cy is  0.65.  The  standard  error  equals  0.017,  and  the  distribution  of  the  esti- 
mates is  very  close  to  the  bell-shaped  distribution  expected  theoretically. 

The  scale  across  the  top  gives  the  ranges  of  the  estimates  as  multiples  of  the 
standard  error. 


exactly  so  because  it  is  based  on  only  100  samples  rather  than  an  infinite 
number.  The  overall  mean  p from  all  100  samples  combined  (80,000  observa- 
tions) equals  0.6492,  which  is  very  close  to  the  true  value  of  p.  Furthermore, 
the  distribution  of  the  estimates  fits  the  predictions  based  on  the  standard 
error  quite  well. 

To  apply  Equation  1.3  to  the  data  in  Figure  1.6,  note  first  that  p = 0.65 
with  n = 800,  and  so  s in  Equation  1.3  equals  V[(0.65  x 0.35) / 800]  = 0.017. 
Because  68%  of  the  samples  are  expected  to  yield  values  of  p in  the  range  p 
± s,  and  because  the  expected  distribution  is  symmetrical,  34  of  the  values  in 
Figure  1.6  are  expected  in  the  range  p - s to  p (0.633-0.650)  and  34  in  the 
range  p to  p + s (0.650-0.667);  the  actual  numbers  are  33  in  the  first  interval 
and  35  in  the  second.  By  the  same  reasoning,  95%  of  the  values  should  lie  in 
the  range  p ± 2s,  or  47.5%  on  each  side  of  the  mean;  because  34%  of  the  values 
on  either  side  of  the  mean  are  in  the  range  p±s,  the  implication  is  that  47.5  - 34 
or  13.5%  of  the  values  should  lie  in  the  range  p-2stop-s  and  13.5%  should 
lie  in  the  range  p + s to  p + 2s.  For  the  data  in  Figure  1.6,  these  ranges  are 
0.616-0.633  and  0.667-0.684;  the  actual  number  in  each  interval  is  18  and  10, 
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respectively,  as  against  the  theoretical  13.5  in  each.  Likewise,  the  standard 
error  predicts  that  0.3%  of  the  samples  will  deviate  by  more  than  3s  from  the 
mean,  as  compared  with  the  observed  2. 

Estimates  and  their  standard  errors  are  often  presented  as  p ±s,  or  0.65  ± 
0.017  in  the  present  example.  The  68%,  95%,  and  99.7%  cutoffs  for  ± 1,  ± 2, 
and  ± 3 standard  errors  provide  one  manner  in  which  the  reliability  of  an 
estimate  may  be  interpreted.  Estimates  may  also  be  presented  alternatively  in 
terms  of  a range  called  a confidence  interval,  which  expresses  a degree  of 
confidence  that  the  true  value  of  a parameter  lies  in  some  specified  interval. 
The  most  frequently  encountered  confidence  interval  is  the  95%  confidence 
interval,  defined  as  the  interval  (p  - 2s,  p + 2s).  Because  95%  of  repeated  sam- 
ples are  expected  to  yield  estimates  in  a range  ± 2s  around  the  true  mean, 
then  95%  of  the  time  the  interval  (p  - 2s)  - (p  + 2s)  is  expected  to  include  the 
true  value  of  the  parameter  p.  In  the  snapdragon  example  with  p = 0.65  and 
s = 0.017,  the  95%  confidence  interval  is  0.616-0.684. 


PROBLEM  1 .4  The  MN  blood  groups  in  human  beings  are  deter- 
mined by  two  alleles  of  a single  gene,  designated  M and  N.  Each  allele 
results  in  the  production  of  a different  type  of  polysaccharide  mole- 
cule on  the  surface  of  red  blood  cells,  which  can  be  distinguished  by 
means  of  appropriate  chemical  reagents.  The  types  of  molecules  cor- 
responding to  the  M and  N alleles  are  designated  M and  N,  respec- 
tively. The  M and  N alleles  are  codominant;  that  is,  genotype  MM 
produces  only  the  M substance  and  has  blood  group  M,  genotype  NN 
produces  only  the  N substance  and  has  blood  group  N,  and  the  het- 
erozygous genotype  MN  produces  both  the  M and  N substances  and 
has  blood  group  MN.  Among  a sample  of  1000  British  people  (Race 
and  Sanger  1975),  the  observed  numbers  of  each  blood  group  were 
298  M,  489  MN,  and  213  N.  Using  these  data,  estimate  the  allele  fre- 
quency p of  the  M allele  and  calculate  its  standard  error.  What  are  the 
68%,  95%,  and  99.7%  confidence  intervals  for  pi 


ANSWER  Because  each  genotype  has  a unique  phenotype,  the 
sample  contains  2 x 298  + 489  = 1085  M alleles,  and  so  p = 1085/2000 
= 0.5425.  The  standard  error  s = V(0.5425)(l  - 0.5425)/  2000  = 0.0111.  The 
68%,  95%,  and  99.7%  confidence  intervals  for  p are  p ± Is,  2s, 
and  3s,  respectively,  and  so  the  confidence  intervals  are  0.5314  - 0.5536 
(68%),  0.5202  - 0.5647  (95%),  and  0.5092  - 0.5758  (97.5%). 
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MODELS  IN  POPULATION  GENETICS 

Population  geneticists  must  contend  with  factors  such  as  population  size, 
patterns  of  mating,  geographical  distribution  of  organisms,  mutation, 
migration,  and  natural  selection.  Although  we  wish  ultimately  to  under- 
stand the  combined  effects  of  all  these  factors  and  more,  the  factors  are  so 
numerous  and  interact  in  such  complex  ways  that  they  cannot  usually  be 
grasped  all  at  once.  Simpler  situations  are  therefore  devised,  situations  in 
which  a few  identifiable  factors  are  the  most  important  ones  and  others  can 
be  neglected.  An  intentional  simplification  of  a complex  situation  is  a 
model.  There  are  several  types  of  models,  each  designed  to  eliminate  extra- 
neous detail  in  order  to  focus  attention  on  the  essentials.  Some  models  are 
experimental.  An  experimental  model  may  consist  of  a laboratory  experi- 
ment with  population  cages  of  Drosophila  or  growing  cultures  of  bacteria. 
An  experimental  model  may  also  consist  of  observations  of  natural  popula- 
tions in  particular  locations  or  at  particular  times  in  which  evolutionary 
forces  of  interest  may  be  presumed  to  be  present.  Models  of  this  type 
include  the  study  of  the  origin  and  spread  of  insecticide  resistance  in  insects 
or  antibiotic  resistance  in  bacteria. 

A model  may  also  be  a conceptual  simplification.  Conceptual  models 
have  a number  of  uses.  They  require  a concise  statement  of  presumed  mech- 
anisms and  interactions;  they  afford  a framework  for  interpreting  observa- 
tions and  setting  research  priorities;  they  enable  extrapolation  into  the  future 
or  beyond  the  range  of  known  parameters;  and  they  suggest  tests  of  consis- 
tency between  theory  and  observation 

A conceptual  model  may  consist  of  verbal  arguments  logically  linking  a 
chain  of  hypothesis  and  deductions.  Another  type  of  conceptual  model  is  a 
computer  program  that  simulates  the  random  component  in  a process  or  that 
calculates  the  values  of  changing  quantities  in  a complex  system  based  on 
prescribed  numerical  relations.  An  example  of  a computer  model  is  the  one 
for  examining  the  result  of  repeated  random  sampling  whose  outcome  is 
depicted  in  Figure  1.6.  In  population  genetics,  a kind  of  model  frequently 
encountered  is  a mathematical  model,  which  is  a set  of  hypotheses  that  spec- 
ifies the  mathematical  relations  between  measured  or  measurable  quantities 
(the  parameters)  in  a system  or  process.  Mathematical  models  can  be 
extremely  useful: 

• They  express  concisely  the  hypothesized  quantitative  relationships 
between  parameters. 

• They  reveal  which  parameters  are  the  most  important  in  a system  and 
thereby  suggest  critical  experiments  or  observations. 

• They  serve  as  guides  to  the  collection,  organization,  and  interpretation  of 
observed  data. 
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• They  make  quantitative  predictions  about  the  behavior  of  a system  that 
can,  within  limits,  be  confirmed  or  shown  to  be  false. 

The  validity  of  any  model  must  be  tested  by  determining  whether  the 
hypotheses  on  which  it  is  based  and  the  predictions  that  grow  out  of  it  are 
consistent  with  observations. 

A mathematical  model  is  always  simpler  than  the  actual  situation  it  is 
designed  to  elucidate.  A model  is  supposed  to  be  simple:  If  it  is  not  simpler 
than  the  real  situation,  then  it  isn't  a model.  Models  are  simpler  than  real  sit- 
uations because  many  features  of  real  life  are  intentionally  ignored.  To 
include  every  aspect  of  a complex  system  would  make  a model  too  complex 
and  unwieldy.  Construction  of  a model  always  requires  a compromise 
between  realism  and  manageability.  A completely  realistic  model  is  likely  to 
be  too  complex  to  handle  mathematically,  and  a model  that  is  mathematical- 
ly simple  may  be  so  unrealistic  as  to  be  useless.  Ideally,  a model  should 
include  all  essential  features  of  the  system  and  exclude  all  nonessential  ones. 
How  good  or  useful  a model  is  often  depends  on  how  closely  this  ideal  is 
approximated.  In  short,  a model  is  a sort  of  metaphor  or  analogy.  Like  all 
analogies,  it  is  valid  only  within  certain  limits  but,  when  pushed  beyond 
these  limits,  becomes  misleading  or  even  absurd. 

In  this  book,  we  are  going  to  take  many  liberties  with  mathematical  rigor. 
Our  excuse  is  that  the  basic  ideas  of  a model  are  often  obscured  rather  than 
illuminated  by  excessive  attention  to  mathematical  detail.  Our  authority  for 
the  approach  is  the  great  physicist  Richard  Feynman,  who  wrote  in  one  of  his 
papers: 

Mathematicians  may  be  completely  repelled  by  the  liberties  taken  here.  The 
liberties  are  taken  not  because  the  mathematical  problems  are  considered 
unimportant.  On  the  contrary,  [I  hope]  to  encourage  the  study  of  these  forms 
from  a mathematical  standpoint.  In  the  meantime,  just  as  a poet  has  a license 
from  the  rules  of  grammar  and  pronunciation,  we  should  like  to  ask  for 
"physicists'  license"  from  the  rules  of  mathematics  in  order  to  express  what 
we  wish  to  say  in  as  simple  a manner  as  possible. 

Exponential  Population  Growth 

To  illustrate  the  nature  of  mathematical  models  (as  well  as  some  of  their  lim- 
itations) we  consider  the  dynamics  of  population  growth,  a subject  of  con- 
siderable interest  in  population  genetics  and  population  biology.  In  Figure 
1.7,  the  solid  dots  show  the  increase  in  the  number  of  cells  of  the  yeast 
Saccharomyces  cerevisiae  in  a defined  quantity  of  culture  medium.  The  num- 
ber of  cells  increases  slowly  at  first  (0-4  hours),  then  more  rapidly  (hours 
4-12),  then  more  slowly  again  (hours  12-18).  As  a first  approximation  of  the 
early  stages  of  population  growth,  we  may  assume  that  a constant  fraction 


28 


Chapter  1 


Figure  1 .7  Increase  in  the  number  of  cells  of  the  yeast  Saccharomyces  cere- 
visiae  in  a defined  quantity  of  culture  medium  (dots).  The  smooth  curves  are 
made  from  mathematical  models  of  exponential  growth  or  logistic  growth. 
(Data  from  Pearl  1927.) 


of  the  cells  reproduces  in  each  interval  of  time.  To  simplify  matters  further, 
we  will  assume  that  the  population  size  does  not  change  gradually  but 
changes  in  a discrete  and  instantaneous  "jump"  at  the  end  of  each  hour.  A 
model  of  this  type  is  a discrete  model  of  population  growth.  Thus,  we  may 
write 

N,  = Nt_\  + rN,^  k 1.4 

where  N,  and  NM  represent  population  size  at  the  end  of  hours  t and  t - 1 and 
where  r is  a constant  called  the  intrinsic  rate  of  increase  equal  to  the  fraction 
of  cells  that  reproduce  in  each  interval  of  time.  This  equation  says  that  the 
population  size  at  the  end  of  hour  t is  the  sum  of  two  components:  (1)  all  the 
cells  present  at  the  end  of  hour  t - 1 (which  means  that  none  of  the  cells  die), 
and  (2)  the  progeny  of  the  rNM  cells  that  divided  in  the  interval. 

Equation  1.4  illustrates  a feature  of  theoretical  population  genetics  that 
sometimes  leads  to  confusion:  the  same  symbols  are  often  used  for  different 
things.  In  this  equation,  r is  the  intrinsic  rate  of  increase  in  population  number. 
In  other  equations  in  population  genetics,  r is  the  recombination  fraction 
between  two  genes  linked  in  the  same  chromosome.  The  symbol  r is  used  for 
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still  other  parameters  also.  Any  possible  confusion  could  be  avoided  by 
indicating  each  parameter  with  a different  letter;  this  solution  is  impractical 
because  one  quickly  runs  out  of  letters,  even  including  Greek  letters.  Another 
way  is  to  distinguish  different  meanings  of  the  same  letter  by  typography,  the 
use  of  superscripts,  subscripts,  and  so  forth.  The  problem  with  this  approach 
is  that  even  simple  equations  get  to  look  imposing.  Still  another  solution,  which 
is  the  one  adopted  in  this  book,  is  to  ask  the  reader  to  play  close  attention  to  the 
context  so  that,  for  example,  r as  used  in  the  context  of  population  growth  is 
not  confused  with  r used  in  the  context  of  genetic  linkage  and  recombination. 

The  solution  to  Equation  1.4  is  straightforward.  Because  N,  = (1  + r)Nt_i,  it 
follows  that  N(_ i = (1  + r)Nt_ 2.  Consequently,  we  can  write  Nt  = (1  + r)(l  + r) 
Nt_ 2 = (1  + r)2N,_2.  However,  N,_2  = (1  + r)N(_3,  and  so  N,  = (1  + r)3N,_3.  Continu- 
ing in  this  manner,  we  eventually  deduce  that 

N,  =(l  + r)'yy|  1.5 

For  the  data  in  Figure  1.7,  if  we  set  N0  = 10  (the  observed  number)  and 
r = 0.7083,  the  first  few  points  from  Equation  1.5  (indicated  by  crosses)  fit 
very  well  — N0  = 10,  Nt  = 17,  N2  = 29,  N3  = 50.  Then  the  model  starts  to  break 
down:  N4  = 85,  N5  = 145,  N6  = 249,  and  thereafter  the  fit  becomes  very  bad 
indeed.  The  lesson  from  this  example  is  that  many  models  have  a range  over 
which  they  are  reasonable  approximations  to  the  real  world,  in  this  case,  for 
a short  time  after  a yeast  culture  is  inoculated.  If  the  model  is  extrapolated 
beyond  its  range  of  validity,  it  yields  nonsense.  The  problem  for  many  mod- 
els in  population  genetics  is  that  their  range  of  validity  is  unknown. 

In  Equation  1 .5,  N is  defined  only  for  t equal  to  positive  integers  because 
of  the  discrete  nature  of  the  model.  Population  growth  is  actually  a continu- 
ous process.  Population  size  increases  gradually  rather  than  in  jumps.  The 
continuous-growth  version  of  Equation  1.5,  shown  by  the  dashed  line  labeled 
"exponential  curve"  in  Figure  1.7,  is  given  by 

N(t)  = N(  0)er°'  1.6 

where  r0  = In  (1  + r).  The  rationale  for  Equation  1.6  is  based  on  the  same  sort 
of  argument  as  Equation  1.4  but  compressing  the  time  scale.  Whereas 
Equation  1.4  assumes  that  each  unit  of  time  is  one  hour,  suppose  that  each 
time  unit  were,  say,  one  minute.  In  slowing  down  the  time  scale  in  this  man- 
ner, we  must  also  decrease  the  value  of  r,  otherwise  too  many  organisms 
would  reproduce  in  each  unit  of  time.  Therefore,  by  analogy  with  Equation 
1.4,  we  can  write  N,  - /VM  = r()Nt_l:  but  here  r0  is  the  intrinsic  rate  of  increase 
in  the  new  time  scale.  If  N(f)  is  a smooth,  continuous  function  and  not 
changing  too  fast,  then  it  is  easy  to  convince  yourself  that  Nt  - N,_i  should 
approximate  the  derivative  of  N(f),  which  is  the  change  in  N(t)  in  a small 
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interval  of  time,  and  that  NM  should  be  close  to  N(f)  because  we  have 
assumed  that  N(t)  is  not  changing  very  fast  in  the  new  time  scale.  Therefore, 
we  can  write 


or 


dN(t) 

dt 


= r0N(t) 


1.7 


dN(t) 

N(t)dt 


1.8 


Because  dlnN(f)  - d N(t)/N(t)dt,  where  In  is  the  base  of  natural  logarithms, 
the  solution  of  Equation  1.8  is  In  N(f)  = r0t  + C,  where  C is  a constant  chosen 
so  that  N(t)  = N( 0)  when  t = 0.  (Hence,  C = In  N( 0).)  Expressing  the  solution  in 
terms  of  N(t)  rather  than  lnN(f)  yields  Equation  1.6.  Furthermore,  comparing 
Equation  1.6  with  Equation  1.5,  it  is  clear  that 


N(0)er°' = (\  + r)' N0 

and  therefore  r0  = ln(l  + r)  is  the  relation  between  the  parameter  r0  in  the  con- 
tinuous model  and  the  parameter  r in  the  discrete  model.  Equation  1.6  is  the 
exponential  function  plotted  In  Figure  1.7  with  N( 0)  = 10  and  r0  = 0.5355. 


PROBLEM  1 .5  Under  optimal  culture  conditions,  the  bacterium 
Escherichia  coli  can  double  in  population  size  every  20  minutes. 
Because  population  growth  is  continuous.  Equation  1.7  is  the  appro- 
priate model.  A single  cell  of  E.  coli  is  cylindrical  in  shape  and  has  a 
volume  of  approximately  1.6  pm3  (1.6  x 10~12  cm3).  A standard  soccer 
ball  has  a diameter  of  22  cm  (roughly  9 inches)  and  a volume  of 
approximately  5600  cm3. 

(a)  What  intrinsic  rate  of  increase  r0  per  minute  results  in  a dou- 
bling time  of  20  minutes? 

(b)  Starting  with  a single  cell  of  E.  coli  growing  under  optimal  con- 
ditions, how  long  would  it  take  to  produce  enough  cells  to  fill 
one  soccer  ball? 

(c)  How  many  soccer  balls  could  be  filled  with  cells  after  24  hours 
of  unrestricted  growth? 


ANSWER  (a)  Set  N( 20)  = 2N(0)  = N( 0)  exp  (r0  x 20),  where  exp  (■) 
stands  for  e().  Therefore,  r0  = (In  2)/20  = 0.034657.  (b)  One  soccer  ball 
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full  of  cells  equals  5600/(1.6  x 1012)  = 3.5  x 1015  cells.  The  time  needed 
to  produce  this  many  cells  is  given  by  t = [In  (3.5  x 1015)]/r0  = 1032.7 
minutes  (17.2  hours),  (c)  After  24  hours  (1440  minutes)  of  unrestricted 
growth,  one  cell  yields  exp  (r0  x 1440)  = 4.7  x 1021  cells,  which  would 
fill  more  than  1.35  million  soccer  balls.  (Note:  If  your  answers  to  this 
problem  are  a little  different  from  those  given,  it  is  probably  because 
the  numbers  given  were  calculated  to  nine  significant  digits  before 
rounding  off.) 


Logistic  Population  Growth 

The  calculations  in  Problem  1.5  indicate  that  no  real  population  can  grow 
exponentially  for  more  than  a relatively  small  number  of  generations  with- 
out catastrophic  consequences.  In  nature,  although  factors  such  as  disease 
and  predation  often  contribute  to  the  control  of  population  size,  populations 
that  grow  too  large  ultimately  must  deplete  the  available  resources.  The  kind 
of  growth  curve  in  Figure  1.7  is  typical  for  populations  expanding  in  a new 
environment:  the  initial  population  growth  is  exponential,  but  then  the  rate 
of  growth  gradually  decreases. 

A simple  alternative  to  exponential  growth  is  the  logistic  model;  the  term 
logistic  refers  to  proportions  and,  in  the  logistic  model,  the  rate  of  population 
growth  is  assumed  to  decrease  in  proportion  to  the  population  size.  By  anal- 
ogy with  Equation  1 .4,  the  change  in  population  size  with  a discrete  model  of 
population  growth  takes  the  form 

1.10 


N,  - N,_  | + rN,_ , 


K 


In  this  equation,  K is  a constant  known  as  the  carrying  capacity  of  the  envi- 
ronment. Observe  that,  when  N is  very  small  compared  with  K,  then  Nt  ~ 
Nm  + rN,_i,  and  so  population  growth  is  nearly  exponential.  On  the  other  hand, 
when  N is  close  to  K,  then  N,  = Nf_ and  so  population  growth  comes  to  a stand- 
still. 

Unlike  Equation  1.4,  Equation  1.10  does  not  have  a simple  solution  for  N, 
in  terms  of  N0.  However,  if  the  population  grows  sufficiently  slowly,  then 
population  growth  can  be  treated  as  continuous,  and  Equation  1.10  yields  the 
differential  equation 


dN(t) 

dt 


K-N(t )\ 
K J 


1.11 


The  solution  of  Equation  1.11  is  given  by 

K 


N(t )■■ 


+ Ce 


1.12 
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where  the  constant  C = (K  - N0)/N0.  Equation  1.12  is  called  the  logistic 
growth  curve  and  it  is  derived  in  Problem  1.7  below.  Logistic  population 
growth  results  in  a sort  of  S-shaped  curve  like  that  shown  in  Figure  1.7, 
where  the  parameters  are  r = 0.5355,  N0  = 10  and  K = 665.  (Note  that  the  r and 
N0  parameters  are  the  same  as  in  the  exponential-growth  model  for  the  same 
data.)  The  fit  is  obviously  very  good  indeed. 


PROBLEM  1 .6  Use  Equation  1.12  with  N0  = 10,  r = 0.5355,  and  K = 
665  to  calculate  N(t)  for  the  times  t = 7 and  8 and  t = 13  and  14.  What 
are  the  values  of  r in  Equation  1.10  for  t = 8 and  t = 14?  Why  are  they 
not  equal  to  0.5355?  Why  are  they  not  equal  to  each  other? 


ANSWER  With  the  given  parameters,  N(7)  = 261.53,  N{ 8)  = 349.43, 
N(13)  = 626.13,  and  N(14)  = 641.68.  Solving  Equation  1.10  for  r and 
substituting  N(t)  yields  r = 0.5540  for  t = 8 and  r = 0.425  for  t = 14.  Nei- 
ther of  these  values  agrees  with  r = 0.5355,  nor  do  they  agree  with 
each  other,  because  Equation  1.10  pertains  to  a discrete  model  and 
Equation  1.12  to  a continuous  model.  When  the  population  grows 
continuously,  the  value  of  r needed  to  produce  a given  change  in  pop- 
ulation size  in  some  discrete  interval  of  time  differs  according  to  the 
magnitude  of  the  change  in  population  size. 


PROBLEM  1.7  Use  the  expression  j[l/x(a  + bx)]dx  = -(1  /a)  In  (a  + bx)/x 
to  derive  the  logistic  growth  curve  from  Equation  1.11. 


ANSWER  Write  Equation  1.11  as  dN(t)/{N(t)[K-  N(t)]\  = rK dt  so  that, 
comparing  with  the  integral  form,  it  is  clear  that  a = K and 
b — —1.  Integrating  both  sides  in  accordance  with  the  formula  results  in 
-(1/fC)  In  [fC  - N(t)]/N(t)  = rt/K  + cnst,  where  cnst  is  a constant  of  integra- 
tion chosen  so  that  N(f)  = N( 0)  when  t = 0.  Hence,  cnst  = -(1  /K)  In 
[X  - N(0)]/N(0)  = — (1/iC)  In  C,  where  C is  the  constant  appearing 
in  Equation  1.12.  Consequently,  In  [K  - N(t)]/N(t)  = -rt  + C,  and  so 

[K  - N(t)]/N(t)  = C exp  -rt.  Equation  1.12  follows  after  some  simplifi- 
cation. 
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SUMMARY 

Population  genetics  is  the  application  of  Mendel's  laws  and  other  genetic 
principles  to  entire  populations  of  organisms.  It  includes  the  study  of  genet- 
ic variation  within  and  between  species  and  attempts  to  understand  the 
processes  resulting  in  adaptive  evolutionary  changes  in  species  through 
time.  Population  genetics  has  many  practical  applications  in  medicine,  agri- 
culture, conservation,  and  other  fields. 

A gene  is  a hereditary  determinant  transmitted  from  parent  to  offspring 
that  influences  a hereditary  trait,  often  in  combination  with  other  genes  and 
also  with  the  environment.  Alleles  are  alternative  forms  of  a gene.  Genotypes 
are  formed  from  pairs  of  alleles  and  are  either  homozygous  (if  the  alleles  in 
the  genotype  are  the  same)  or  heterozygous  (if  the  alleles  are  different).  The 
physical  or  biochemical  characteristics  of  an  organism  constitute  its  pheno- 
type. The  essential  mechanism  of  genetic  transmission  was  established  in 
experiments  by  Gregor  Mendel  in  the  years  1856  to  1863.  Mendel  showed 
that  the  alleles  of  each  gene  separate  (segregate)  from  one  another  in  the  for- 
mation of  reproductive  cells  or  gametes.  Genes  are  arranged  in  linear  order 
along  chromosomes.  A chromosome  may  contain  several  thousand  genes. 
Alleles  of  different  genes  present  in  the  same  chromosome  tend  to  be  inherit- 
ed together  (linkage),  but  the  allele  combinations  can  be  broken  up  by  recom- 
bination. 

Chemically,  a gene  is  a region  of  a DNA  molecule.  DNA  is  a metaphorical 
"twisted  ladder"  consisting  of  two  paired  strands  composed  of  polymers  of 
nucleotides  (the  sidepieces  of  the  ladder)  whose  bases  (either  A,  T,  G,  or  C) 
jut  inward  from  the  sidepieces  to  form  the  rungs.  Each  rung  of  the  ladder 
consists  of  either  an  A-T  base  pair  or  a G-C  base  pair.  Most  genes  code  for 
the  polypeptide  chains  of  proteins  through  a transcript  of  RNA  that  is 
processed  into  the  messenger  RNA  (mRNA).  The  polypeptide  is  produced 
stepwise  by  translation  of  the  mRNA  according  to  a triplet  genetic  code,  in 
which  each  nonoverlapping  group  of  three  adjacent  bases  (a  codon)  specifies 
the  amino  acid  to  be  attached  to  the  growing  chain.  Alleles  differ  in  their 
sequence  of  nucleotides.  A nucleotide  substitution  in  the  third  position  of  a 
codon  may  not  result  in  an  amino  replacement  in  the  encoded  polypeptide 
because  of  redundancy  in  the  genetic  code.  However,  most  nucleotide  substi- 
tutions in  either  of  the  first  two  positions  do  result  in  amino  acid  replace- 
ments. 

A probability  is  a number  between  0 and  1 that  measures  the  likelihood  of 
a particular  event  being  realized  in  an  actual  or  conceptual  experiment.  The 
addition  rule  applies  to  mutually  exclusive  events  and  states  that  the  proba- 
bility of  one  or  the  other  event  being  realized  equals  the  sum  of  the  separate 
probabilities.  The  multiplication  rule  applies  to  independent  events  and 
states  that  the  probability  of  both  events  being  realized  simultaneously 
equals  the  product  of  the  separate  probabilities.  The  probabilities  of  various 
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outcomes  of  repeated  and  independent  trials  can  be  deduced  by  application 
of  the  addition  and  multiplication  rules  and  conforms  to  successive  terms  in 
the  binomial  expansion  ( p + q)". 

Natural  populations  contain  genetic  variation  in  the  form  of  multiple  alle- 
les of  many  genes.  For  any  specified  allele,  the  allele  frequency  is  the  propor- 
tion of  all  alleles  of  the  gene  that  are  of  the  specified  type.  The  allele 
frequency  in  a population  must  usually  be  estimated  from  a sample,  and  so 
there  is  variation  in  the  estimate  from  one  sample  to  the  next.  The  variation  is 
quantified  by  the  standard  error.  If  the  distribution  of  the  estimates  conforms 
to  a normal,  bell-shaped  distribution,  then  the  proportions  of  the  estimates 
lying  within  ± 1,  ± 2,  and  ± 3 standard  deviations  of  the  true  value  of  the 
parameter  are  68%,  95%,  and  99.7%,  respectively.  Estimates  are  also  often  pre- 
sented as  a confidence  interval,  which  expresses  the  degree  of  confidence  that 
the  true  value  of  a parameter  lies  in  some  specified  interval. 

A model  is  a deliberate  simplification  of  a complex  situation.  Models  may 
be  experimental  or  conceptual.  Conceptual  models  may  be  verbal,  computa- 
tional, or  mathematical.  Mathematical  models  are  widely  used  in  population 
genetics.  They  specify  the  mathematical  relations  between  measured  or  mea- 
surable quantities  that  determine  the  changes  in  allele  frequency  in  popula- 
tions. Population  growth  affords  an  example  of  mathematical  modeling.  In 
the  simplest  model  of  discrete  population  growth,  at  discrete  times  a constant 
fraction  of  the  population  reproduces,  and  so  the  population  jumps  instanta- 
neously from  one  size  to  the  next.  A more  realistic  model  envisages  continu- 
ous reproduction  through  time,  in  which  case  population  growth  is 
exponential.  The  exponential  model  often  fits  population  growth  in  newly 
colonized  environments  when  the  population  density  is  low.  Population 
growth  is  ultimately  limited  by  nutrients,  space,  or  other  resources.  When 
population  growth  decreases  in  proportion  to  population  size,  the  S-shaped 
logistic  curve  of  population  growth  results;  this  curve  is  determined  by  the 
intrinsic  rate  of  increase  r and  the  carrying  capacity  of  the  environment  K. 

PROBLEMS 

1.  If  you  were  to  catch  a collection  of  Drosophila,  grind  each  one  individual- 
ly in  a buffer  solution,  and  measure  the  rate  at  which  this  crude  whole-fly 
homogenate  catalyzed  the  reaction  for  glucose-6-phosphate  dehydroge- 
nase, you  would  find  that  the  activities  would  vary  by  more  than  four 
fold.  Make  a list  of  possible  causes  of  this  variation. 

2.  Given  the  complexity  of  causes  of  variation  in  Problem  1,  how  much  vari- 
ation would  you  expect  to  see  in  the  underlying  genetic  cause  of  a human 
inborn  error  of  metabolism  such  as  phenylketonuria?  This  disorder  is 
caused  by  insufficient  activity  of  phenylalanine  hydroxylase. 

3.  There  are  64  codons  in  the  genetic  code,  and  each  codon  can  undergo 
nine  single-site  mutations  (each  base  can  mutate  to  three  other  bases),  for 
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a total  of  576  mutations.  How  many  of  these  result  in  no  change  in  the 
"meaning"  of  the  encoded  sequence? 

4.  Assuming  that  all  nucleotides  in  all  codons  mutate  with  equal  frequency 
(i.e.,  that  all  576  mutations  in  Problem  3 occur  at  the  same  rate),  are  muta- 
tions from  one  amino  acid  to  another  all  equally  likely? 

5.  The  correspondence  between  genotype  and  phenotype  is  one  of  the  most 
complex  and  difficult  aspects  of  evolutionary  genetics.  Describe  an  exam- 
ple of  a gene  whose  mutations  cause  more  than  one  distinctly  different 
phenotype  that  do  not  appear  to  be  related. 

6.  A population  cage  of  Drosophila  melanogaster  is  started  with  50  males  and 
50  females,  all  having  the  genotype  ( e st)/(e+  st+).  This  notation  implies 
that  one  chromosome  has  the  e and  st  mutations,  and  the  other  has  the 
wild  type  allele  at  both  loci.  These  two  loci  show  a frequency  of  recombi- 
nation in  females  of  r = 0.37,  and  the  males  produce  only  non-recombi- 
nant  gametes.  Calculate  the  expected  frequency  of  the  gametes  for  both 
males  and  females  and  the  expected  offspring  genotype  frequencies. 

7.  In  some  human  cultures  it  is  very  important  to  have  a son  and  a daugh- 
ter, and  couples  continue  having  offspring  until  they  have  one  of  each.  If 
an  entire  population  followed  this  rule,  what  would  happen  to  the  sex 
ratio  in  the  population? 

8.  If  two  genes  are  on  different  chromosomes,  the  probability  that  a gamete 
has  a particular  allele  of  each  of  the  two  genes  is  the  product  of  the  prob- 
ability of  drawing  each  allele  because  the  draws  are  independent  of  one 
another  (see  the  multiplication  rule).  If  each  gene  is  on  a different  chro- 
mosome, what  is  the  chance  that  genotype  Aa  Bb  CC  Dd  produces  two 
consecutive  gametes  that  are  ABCD ? 

9.  If  individual  X has  an  autosomal  recessive  disease  and  both  parents  are 
unaffected,  what  is  the  chance  that  the  sibling  of  X is  a heterozygous  car- 
rier? 

10.  A line  of  mice  seems  to  consistently  produce  55%  male  and  45%  female 
offspring.  In  order  to  test  whether  this  deviation  is  significant,  how  many 
offspring  would  you  have  to  could  to  be  able  to  reject  a 50  : 50  sex  ratio  at 
a probability  of  a = 0.05?  (Assume  that  the  sex  ratio  of  the  mice  remains 
55  : 45.) 

11.  A species  of  butterflies  occurs  in  two  distinct  morphs,  A and  B.  You  sam- 
ple two  areas  and  count  26  A and  28  B butterflies  in  one  area,  and  10  A 
and  21  B in  another  area.  Is  it  possible  that  these  two  samples  could  come 
from  a single  homogeneous  population,  or  are  the  frequencies  of  the  two 
morphs  significantly  different  from  one  another? 

12.  Levy  and  Levin  (1975)  used  electrophoresis  to  study  the  phosphoglucose 
isomerase-2  gene  in  the  evening  primrose  Oenothera  biennis,  a complex 
genomic  heterozygote  made  true  breeding  by  chromosomal  transloca- 
tions. They  observed  two  alleles  affecting  electrophoretic  mobility  of  the 
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enzyme,  and  among  57  strains  they  found  35  PGl-la/PGl-la,  19  PGl- 
2a/PGI-2b,  and  3 PGI-2b/PGI-2b  genotypes. 

a.  Calculate  the  allele  frequencies  of  PGl-2a  and  PGI-2b. 

b.  With  random  mating,  what  would  be  the  expected  numbers? 

13.  The  simple  models  of  population  growth  fail  to  take  into  account  many 
factors  that  affect  rates  of  change.  The  global  human  population  at  0 a.d., 
200  a.d.,  and  at  intervals  of  200  years  up  to  the  present  has  been  estimat- 
ed in  millions  of  people  as  200,  200,  200,  200,  250,  280,  350,  400,  550,  980, 
and  6000.  If  the  population  were  growing  exponentially,  these  points 
would  fall  on  a straight  line  when  plotted  on  a logarithmic  scale.  Draw 
this  plot.  What  do  you  conclude? 

14.  A healthy  pair  of  Drosophila  can  produce  500  offspring  in  12  days,  each 
adult  fly  weighing  about  1 mg.  Assume  that  the  parental  flies  die  after 
they  finish  reproducing.  (Actually,  they  live  about  a month.)  If  all  succes- 
sive generations  get  enough  to  eat  and  remain  this  fecund,  what  will  the 
mass  of  flies  be  in  one  year? 
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enetic  variation  in  populations  became  a subject  of  scientific 
inquiry  in  the  late  nineteenth  century  prior  even  to  the  rediscov- 
ery of  Mendel's  paper  in  1900.  The  leading  exponent  of  the  study 
of  hereditary  differences  among  human  beings  was  Francis  Galton 
(1822-1911).  Galton  was  a pioneer  in  the  application  of  statistics  to  biology. 
He  used  statistical  methods  to  study  physical  traits  such  as  eye  color  and  fin- 
gerprint ridges  as  well  as  behavioral  traits  such  as  temperament  and  musical 
ability.  Galton  was  among  the  first  to  examine  the  statistical  relations 
between  the  distributions  of  phenotypic  traits  in  successive  generations.  He 
is  regarded  as  the  founder  of  biometry,  the  application  of  statistics  to  biologi- 
cal problems. 


PHENOTYPIC  VARIATION  IN  NATURAL  POPULATIONS 

Galton  and  Mendel  exemplify  opposite  approaches  to  the  study  of  inherited 
traits.  Mendel's  point  of  departure  in  the  study  of  genetics  was  discrete  vari- 
ation, in  which  phenotypic  differences  among  organisms  can  be  assigned  to 
a small  number  of  clearly  distinct  classes,  such  as  round  versus  wrinkled 
peas.  Gabon's  point  of  departure  was  continuous  variation,  in  which  the 
phenotypes  of  organisms  are  measured  on  a quantitative  scale,  like  height  or 
weight,  and  in  which  the  phenotypes  grade  imperceptibly  from  one  catego- 
ry into  the  next.  As  material  for  the  study  of  phenotypic  variation,  Gabon's 
choice  was  good:  most  of  the  differences  among  normal  people  that  are  vis- 
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ible  to  the  unaided  eye  are  differences  in  continuous  traits — height,  weight, 
skin  color,  hair  color,  facial  features,  running  speed,  shoe  size,  and  so  forth. 
The  same  is  true  of  phenotypic  variation  in  other  organisms.  On  the  other 
hand,  as  material  for  the  study  of  genetic  variation,  Mendel's  choice  was 
good:  The  pattern  of  segregation  of  alleles  is  revealed  most  clearly  in  pedi- 
grees of  discrete,  simple  Mendelian  traits. 

Continuous  Variation:  The  Normal  Distribution 

With  continuous  traits,  not  only  do  the  phenotypes  grade  into  one  another, 
but  the  traits  also  usually  present  difficulties  for  genetic  analysis.  The  prob- 
lems are  of  two  principal  types: 

• Most  continuous  traits  are  influenced  by  the  alleles  of  two  or  more  genes, 
hence  the  segregation  of  any  one  gene  in  pedigrees  is  obscured  by  the 
segregation  of  other  genes  that  affect  the  trait. 

• Most  continuous  traits  are  influenced  by  environmental  factors  as  well  as 
by  genes,  and  so  genetic  segregation  is  obscured  by  environmental 
effects. 


These  problems  are  not  insurmountable  in  organisms  with  a sufficiently 
high  density  of  genetic  markers  scattered  throughout  the  genome  (the  com- 
plement of  chromosomes)  because  the  genetic  markers  can  be  tracked  in 
pedigrees  along  with  the  continuous  trait  of  interest.  Organisms  with  suffi- 
ciently dense  genetic  maps  include  human  beings,  laboratory  animals,  and 
many  domesticated  animals  and  crop  plants. 

In  Galton  s time,  however,  studies  of  continuous  traits  based  on  genetic 
linkage  were  unknown.  Why,  then,  did  Galton  focus  on  continuous  traits? 


Because  they  have  a sort  of  regularity — a statistical  predictability — of  their 
own.  For  many  continuous  traits,  when  the  phenotypes  are  grouped  into 
suitable  intervals  and  plotted  as  a bar  graph,  the  distribution  of  phenotypes 
conforms  closely  to  the  normal  distribution,  the  symmetrical,  bell-shaped 
curve  discussed  briefly  in  Chapter  1 in  the  section  on  phenotypic  diversity 
and  genetic  variation.  For  example,  a bar  graph  of  Gabon's  data  on  the 
heights  of  1329  men,  rounded  to  the  nearest  inch,  is  plotted  in  Figure  2.1. 
The  smooth  curve  is  the  normal  distribution  that  best  fits  the  data.  The  equa- 
tion of  the  normal  curve  is: 


/(*)  = 


1 


V27ta 
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where  x ranges  from  -°°  to  +°°,  and  n = 3.14159  and  e = 2.71828  are  constants. 
The  location  of  the  peak  of  the  distribution  along  the  x axis  is  determined  by 
the  parameter  p,  which  is  the  mean,  or  average,  of  the  phenotypic  values. 
The  degree  to  which  the  phenotypes  are  clustered  around  the  mean  is  deter- 
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Height  (rounded  to  the  nearest  inch) 


Figure  2.1  Distribution  of  height  among  1329  British  men.  (Data  from  Galton 
1889.) 


mined  by  the  parameter  a2,  which  is  the  variance  of  the  distribution. 
Mathematically,  the  variance  is  the  average  of  the  squared  difference  of  each 
phenotypic  value  from  the  mean;  that  is,  it  is  the  average  of  the  values  of 
(x  - p)2.  How  p and  a2  are  estimated  from  data  is  considered  next. 


Mean  and  Variance 

Because  p and  o2  are  parameters,  their  values  are  unknown,  and  they  must 
be  estimated  from  the  data  themselves.  The  height  data  are  tabulated  in 
Table  2.1,  in  which  f is  the  number  of  men  whose  height  is  x,,  rounded  to  the 
nearest  inch.  (The  fact  that  the  shortest  and  tallest  men  are  grouped  in  the 
tails  of  the  distribution  makes  no  difference  because  these  men  account  for 
only  a small  proportion  of  the  total  sample.)  Also  tabulated  are  the  products 
fi  x Xj  and  f x x,2  as  well  as  their  sums. 

The  mean  p of  the  distribution  is  estimated  as  the  mean  of  the  sample, 
which  is  conventionally  denoted  x (also  sometimes  as  p): 


In  this  example,  x = 91,639/1329  = 68.95  inches. 

Likewise,  the  variance  ct2  of  the  distribution  is  estimated  as  the  variance  of 
the  sample,  which  is  conventionally  denoted  s2  (also  sometimes  as  62): 


, ^fi(Xi-x)2 

= I./;  = I./; 
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The  expression  in  the  middle  follows  directly  from  the  definition  of  the  vari- 
ance: it  is  the  average  of  the  squared  deviations  from  the  mean  because,  for 
each  value  of  x±,  (x,  — x)  is  the  deviation  of  that  value  from  the  mean.  The 
expression  on  the  right  is  identical  arithmetically  but  easier  to  apply  in  prac- 
tice. In  the  example  in  Table  2.1,  s2  = 6,326,939/1329  - (68.96)2  = 6.11.  (This 
value  may  differ  slightly  from  your  own  calculation  according  to  the  num- 
ber of  significant  digits  you  carried  along  before  rounding  off.)  If  the  sample 
size  is  small  (say,  less  than  50),  then  a slightly  better  estimate  of  the  variance 
is  obtained  by  multiplying  the  expression  in  Equation  2.3  by  n/(n  - 1),  where 
n is  the  total  size  of  the  sample  (in  this  case,  1329). 

Closely  related  to  the  variance  is  the  standard  deviation  of  the  distribu- 
tion, which  is  the  square  root  of  the  variance.  The  standard  deviation  is  a nat- 
ural quantity  to  consider  in  view  of  the  units  of  measurement.  In  Table  2.1, 
for  example,  each  measurement  is  in  inches.  The  mean  is  also  in  inches.  How- 
ever, the  variance,  being  the  average  of  squared  deviations,  has  the  units  of 
squared  inches — which  seems  more  appropriate  for  an  area  than  for  a height. 
Taking  the  square  root  of  the  variance  restores  the  correct  unit  of  measure:  in 
this  example,  inches.  The  estimate  of  the  standard  deviation  is  conventional- 
ly denoted  s (also  sometimes  as  d)  and  it  is  calculated  as  the  square  root  of 
the  quantity  in  Equation  2.3.  In  the  height  example,  s = 2.47  (which  may 


TABLE  2.1 

HEIGHTS  OF  1329  MEN 

Height 
interval  (i) 

Height 
range  (in.) 

Nearest 
inch  (Xj) 

Number  of 
men  (f,) 

fi  X Xi 

fi  x Xi2 

1 

<63.5 

63 

23 

1,449 

91,287 

2 

63.5-64.5 

64 

20 

1,280 

81,920 

3 

64.5-65.5 

65 

64 

4,160 

270,400 

4 

65.5-66.5 

66 

110 

7,260 

479,160 

5 

66.5-67.5 

67 

155 

10,385 

695,795 

6 

67.5-68.5 

68 

199 

13,532 

920,176 

7 

68.5-69.5 

69 

203 

14,007 

966,483 

8 

69.5-70.5 

70 

198 

13,860 

970,200 

9 

70.5-71.5 

71 

171 

12,141 

862,011 

10 

71.5-72.5 

72 

88 

6,336 

456,192 

11 

72.5-73.5 

73 

47 

3,431 

250,463 

12 

73.5-74.5 

74 

27 

1,998 

147,852 

13 

>74.5 

75 

24 

1,800 

135,000 

Totals 

1,329 

91,639 

6,326,939 

(X/,) 

(X/,*,) 

(X/,*,2) 

Source:  Data  from  Galton  1889. 
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again  differ  slightly  from  your  own  calculation  because  of  round-off  error). 
The  estimate  s of  the  standard  deviation  is  often  called  the  standard  error. 
When  estimating  a proportion — such  as  the  frequency  of  an  allele  in  a 
population — the  standard  error  is  calculated  according  to  Equation  1.3  in 
Chapter  1. 

In  Chapter  1,  the  values  68%,  95%,  and  99.7%  quoted  as  the  proportions  of 
observations  expected  to  fall  within  1,  2,  or  3 standard  errors  of  the  mean, 
respectively,  emerge  directly  from  Equation  2.1  for  the  normal  distribution.  In 
a normal  distribution,  the  exact  proportion  of  observations  falling  with  any 
specified  range  of  x equals  the  integral  of  Equation  2.1  across  the  specified 
range.  For  the  normal  distribution,  the  integral  between  the  limits  p ± o 
equals  0.6827,  that  between  p ± 2a  equals  0.9545,  and  that  between  p ± 3a 
equals  0.9973.  In  data  analysis,  x and  s are  used  in  place  of  p and  a.  Inciden- 
tally, the  integral  of  the  normal  distribution  between  the  limits  p ± 4a  equals 
0.9999;  this  result  says  that  fewer  than  one  in  10,000  observations  falls  more 
than  four  standard  deviations  from  the  mean. 

Central  Limit  Theorem 

Galton  was  immensely  impressed  with  the  observation  that  many  natural 
phenomena  follow  the  normal  distribution.  He  writes: 

I know  of  scarcely  anything  so  apt  to  impress  the  imagination  as  the  wonder- 
ful form  of  cosmic  order  expressed  by  the  "law  of  frequency  of  error"  [the  nor- 
mal distribution].  Whenever  a large  sample  of  chaotic  elements  is  taken  in 
hand  and  marshaled  in  the  order  of  their  magnitude,  this  unexpected  and 
most  beautiful  form  of  regularity  proves  to  have  been  latent  all  along.  The  law 
would  have  been  personified  by  the  Greeks  if  they  had  known  of  it.  It  reigns 
with  serenity  and  complete  self-effacement  amidst  the  wildest  confusion.  The 
larger  the  mob  and  the  greater  the  apparent  anarchy,  the  more  perfect  is  its 
sway.  It  is  the  supreme  law  of  unreason. 

It  is,  indeed,  remarkable  to  consider  that  pure,  blind  chance  is  the  reason  for 
this  "unexpected  and  most  beautiful  form  of  regularity." 

The  theoretical  basis  of  the  normal  distribution  is  known  in  probability 
theory  as  the  central  limit  theorem.  Roughly  speaking,  the  central  limit  the- 
orem states  that  the  sum  of  a large  number  of  independent  random  quanti- 
ties always  converges  to  the  normal  distribution.  For  our  purposes, 
"independent"  in  this  context  means  that  information  about  any  one  of  the 
observations  gives  no  improvement  in  the  ability  to  predict  any  other  of  the 
observations.  A large  number  of  independent  random  quantities  is  appar- 
ently what  Galton  meant  by  "a  large  sample  of  chaotic  elements."  The  central 
limit  theorem  explains  in  part  why  so  many  continuously  distributed  traits 
conform  to  the  normal  distribution.  Most  continuous  traits  are  multifactorial, 
meaning  that  they  are  influenced  by  "many  factors,"  typically  several  or 
many  genes  acting  together  with  environmental  factors.  Among  human 
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beings,  for  example,  the  obvious  differences  between  normal  people  in  hair 
color,  eye  color,  skin  color,  stature,  weight,  and  other  such  traits  are  not  usu- 
ally traceable  to  single  genes.  They  result  from  the  combined  effects  of  sever- 
al or  many  genes  as  well  as  numerous  environmental  effects  acting  together 
as  "a  large  sample  of  chaotic  elements,"  which  often  produce,  in  the  aggre- 
gate, a normal  distribution  of  phenotypes. 

It  should  be  emphasized  that  the  "large  number"  of  random  elements 
specified  in  the  central  limit  theorem  need  not  be  excessive.  As  an  example, 
Figure  2.2  is  a bar  graph  of  100  observations  in  which  each  "observation"  con- 
sists of  the  sum  of  nine  consecutive  random  numbers  chosen  with  equal 
probability  from  anywhere  in  the  range  (-1,  +1).  For  the  sum  of  nine  random 
numbers  in  this  range,  the  theoretical  mean  equals  0 and  the  theoretical  stan- 
dard deviation  equals  1.73;  the  sample  values  were  x = -0.12  and  s = 1.70. 
Expressed  as  a deviation  from  the  mean  in  multiples  of  the  standard  error, 
the  number  of  observations  in  each  category  is  shown  at  the  top  of  the  bar  in 
Figure  2.2.  Because  the  expected  numbers  are  2.5, 13.5, 68, 13.5,  and  2.5,  the  fit 
to  a normal  distribution  is  obviously  very  good.  In  this  example,  therefore, 
fewer  than  10  "chaotic  elements,"  when  added  together,  yields  "this  unex- 
pected and  most  beautiful  form  of  regularity." 


PROBLEM  2.1  At  an  International  Health  Exhibition  in  London  in 
1884,  Galton  set  up  an  "anthropometric  laboratory"  that  carried  out 
tens  of  thousands  of  measurements  covering  a wide  range  of  human 
traits.  Among  the  traits  was  "strength  of  pull,"  expressed  as  the  num- 
ber of  pounds  that  a person  could  pull  with  one  arm  against  a resist- 
ing force  in  a sort  of  arm-wrestling  contraption  (Galton  1889).  The 
data  for  519  males  aged  23-26  years  fell  into  the  following  categories 
(the  number  in  parentheses  is  the  number  of  males  in  each  category): 
40-50  lbs  (10),  50-60  (42),  60-70  (140),  70-80  (168),  80-90  (113),  90-100 
(22),  100-110  (24).  Using  the  midpoint  of  each  category  as  the  strength 
of  pull  for  all  males  in  that  category,  estimate  the  mean  and  standard 
deviation  of  strength  of  pull.  Assuming  that  strength  of  pull  has  a 
normal  distribution  with  parameters  equal  to  these  estimates,  what  is 
the  expected  proportion  of  males  whose  strength  of  pull  exceeds  112 
pounds? 


ANSWER  The  values  of  x,  are  45, 55, 65,  and  so  forth.  Then  1/  = 519, 
ZfiXi  = 38,675,  and  I/p:,-2  = 2,963,375.  Hence,  x = 74.5  lbs,  s2  = 156.8  lbs2. 
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Figure  2.2  Distribution  of  100  values  of  the  sum  of  nine  random  numbers 
from  the  interval  (-1,  +1). 


and  so  s = 12.5  lbs.  (Answers  may  differ  slightly  because  of  round-off 
error.)  A strength  of  pull  of  112  lbs  is  three  standard  errors  above  the 
mean;  hence  a proportion  of  only  (1  - 0.997)/2  = 0.0015  (about  one  in 
667)  males  is  expected  to  have  a phenotype  exceeding  this  value. 


Discrete  Mendelian  Variation 

Discrete  Mendelian  variation  (also  called  simple  Mendelian  variation)  refers 
to  phenotypic  differences  resulting  from  segregation  of  the  alleles  of  a single 
gene.  Environmental  effects  on  the  trait  are  small  enough,  relative  to  hered- 
itary differences,  that  the  transmission  of  alleles  determining  the  trait  can  be 
traced  through  pedigrees.  An  example  of  discrete  Mendelian  variation  is  the 
inheritance  of  red,  pink,  or  white  flower  color  in  snapdragons  (Chapter  1). 
This  case  is  exceptionally  convenient  for  genetic  studies  because  of  the  inter- 
mediate phenotype  of  the  heterozygote.  However,  most  of  the  phenotypic 
variation  in  natural  populations  is  multifactorial.  In  human  beings,  for  exam- 
ple, although  simple  Mendelian  variation  accounts  for  many  inherited  dis- 
orders, each  of  the  disorders  is  relatively  rare. 

Ironically,  simple  Mendelian  variation  is  more  easily  detected  by  studying 
genes  and  their  products  than  by  studying  phenotypes.  Because  the  mecha- 
nisms of  transcription,  RNA  processing,  and  translation  are  relatively  free  of 
the  gene  interactions  and  environmental  effects  that  complicate  the  analysis 
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of  multifactorial  traits  at  the  phenotypic  level,  there  is  a direct  connection 
between  DNA  sequences  and  alleles  and  a nearly  direct  connection  between 
genes  and  their  products.  Indeed,  the  correspondence  between  DNA 
sequences  and  alleles  is  one-to-one:  different  alleles  have  different  DNA 
sequences  irrespective  of  whether  the  alleles  affect  phenotype.  Likewise,  alle- 
les with  nonsynonymous  codon  differences  in  a protein-coding  region  result 
in  different  amino  acid  sequences  irrespective  of  what  the  polypeptide  does 
in  metabolism  or  how  the  difference  in  sequence  affects  the  organism. 

Hence,  an  efficient  way  to  detect  simple  Mendelian  variation  is  to  study 
molecules— and  therein  lies  a paradox.  As  evolutionary  biologists,  popula- 
tion geneticists  are  interested  in  observable  phenotypes  that  are  likely  to  be 
subject  to  natural  selection:  morphology,  rate  of  development,  mating  behav- 
ior, age  of  reproduction,  longevity,  and  so  forth  (in  short,  the  types  of  traits 
that  attracted  Galton).  On  the  other  hand,  genetic  studies  are  most  readily 
carried  out  with  simple  Mendelian  variation  detected  as  differences  between 
molecules.  The  paradox  is  that  differences  in  molecules  among  healthy 
organisms  are  not  usually  related  in  any  obvious  way  to  differences  in  phe- 
notype. Thus,  there  is  a gap  in  being  unable  to  specify  exactly  which  types  of 
molecular  differences  underlie  the  evolutionary  process.  The  irony  of  the  sit- 
uation is  similar  to  that  described  by  the  physiologist  Albert  Szent-Gyorgyi: 

My  own  scientific  life  was  a descent  from  higher  to  lower  dimensions,  led  by 
the  desire  to  understand  life.  I went  from  animals  to  cells,  from  cells  to  bacte- 
ria, from  bacteria  to  molecules,  from  molecules  to  electrons.  The  story  had  its 
irony,  for  molecules  and  electrons  have  no  life  at  all.  On  my  way,  life  ran  out 
between  my  fingers. 

The  gap  between  genotype  and  phenotype  results  from  the  complex  inter- 
actions between  genes  and  environment  in  the  determination  of  physiology, 
development,  and  behavior.  In  evolutionary  biology,  the  complexity  is  even 
greater  because  the  key  issue  is  the  relative  ability  of  organisms  to  survive  and 
reproduce  in  their  environments.  Nevertheless,  the  disconnect  between  dif- 
ferences in  molecules  and  evolutionary  adaptations  is  by  no  means  inevitable, 
permanent,  or  insurmountable.  It  is  already  clear  that  the  study  of  the  relation 
between  genetic  variation  and  evolutionary  adaptation  must  be  high  on  the 
agenda  of  evolutionary  biology  for  the  next  century,  and  already  there  are 
many  examples  in  which  the  relation  is  quite  well  established. 

EXPERIMENTAL  METHODS  FOR  DETECTING 
GENETIC  VARIATION 

For  nearly  50  years,  the  workhorse  method  for  revealing  genetic  variation 
has  been  electrophoresis  because  small  differences  in  rate  of  migration  in  an 
electrophoretic  field  can  be  used  to  distinguish  between  nearly  identical 
macromolecules.  A typical  laboratory  setup  for  electrophoresis  is  illustrated 
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Bands  (visible  after 
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Figure  2.3  One  type  of  laboratory  apparatus  for  electrophoresis.  The  proce- 
dure is  widely  used  to  separate  protein  or  DNA  molecules.  In  conventional  gels, 
DNA  fragments  smaller  than  about  20  kb  migrate  approximately  in  proportion 
to  the  logarithm  of  their  molecular  weights. 


schematically  in  Figure  2.3.  The  tray  contains  a thick  layer  of  a gel,  typically 
starch,  acrylamide,  or  agarose;  it  may  be  placed  horizontally  (as  shown  in  the 
illustration)  or  vertically  (with  the  gel  sandwiched  between  two  glass  plates). 
Each  sample  of  material  is  placed  in  a small  slot  near  the  edge  of  the  gel. 
Connected  to  each  edge  of  the  gel  is  a chamber  containing  a buffered  solu- 
tion and  electrodes.  In  electrophoresis,  an  electric  current  is  applied  across 
the  gel  for  several  hours.  Molecules  in  the  samples — usually  proteins  or 
nucleic  acids  are  of  greatest  interest — move  through  the  gel  in  response  to 
the  electric  field.  Molecules  of  different  size  and  charge  move  at  different 
rates.  After  the  electrophoresis  is  finished,  the  positions  of  the  molecule  or 
molecules  of  interest  are  revealed  by  any  of  several  procedures. 

Protein  Electrophoresis 

In  protein  electrophoresis,  used  primarily  to  study  enzyme  molecules,  the 
position  to  which  a particular  enzyme  migrates  is  revealed  by  soaking  the 
gel  in  a solution  containing  a substrate  for  the  enzyme  along  with  a dye  that 
precipitates  where  the  enzyme-catalyzed  reaction  takes  place.  A dark  band 
thus  appears  in  the  gel  at  the  position  of  the  enzyme.  If  the  enzyme  present 
in  a sample  has  an  amino  acid  replacement  that  results  in  a difference  in  the 
overall  ionic  charge  of  the  molecule,  then  the  enzyme  will  have  a somewhat 
altered  electrophoretic  mobility  and  move  at  a different  rate.  The  elec- 
trophoretic mobility  changes  because  enzymes  of  the  same  size  and  shape 
move  at  a rate  determined  largely  by  the  ratio  of  the  number  of  positively 
charged  amino  acids  (primarily  lysine,  arginine,  and  histidine)  to  the  num- 
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ber  of  negatively  charged  ones  (principally  aspartic  acid  and  glutamic  acid). 
Electrophoresis  can  therefore  be  used  to  detect  a mutation  that  results  in  a 
difference  in  electrophoretic  mobility  of  the  enzyme  it  encodes. 

One  possible  result  of  an  electrophoresis  experiment  is  shown  in  the 
hypothetical  gel  in  Figure  2.4A,  in  which  all  samples  manifest  an  enzyme 
with  the  same  electrophoretic  mobility.  The  result  indicates  a monomorphic 
sample  because  there  is  only  one  electrophoretic  pattern  observed.  Another 
kind  of  result  is  shown  in  Figure  2.4B,  in  which  polymorphism  is  observed 
in  the  types  of  electrophoretic  patterns.  When  polymorphic  enzyme  bands 
are  observed,  genetic  tests  typically  indicate  that  organisms  with  only  a 
fast-migrating  enzyme  are  homozygous  for  a fast  allele  (F/F)  and  those 
with  only  a slow-migrating  enzyme  are  homozygous  for  a slow  allele  (S /S). 
Organisms  with  both  enzyme  bands  are  heterozygous  for  the  alleles  (F/S). 
Simple  Mendelian  inheritance  of  the  polymorphism  is  indicated  by,  for 
example,  the  finding  that  matings  of  two  heterozygotes  produce,  on  the 
average,  V4  F/F,  V2  F/ S,  and  x/4  S/ S progeny.  Two  enzyme  bands  appear  in 
heterozygotes  whenever  the  active  enzyme  consists  of  a single  polypeptide 
chain  (rather  than  two  or  more  polypeptide  chains  aggregated  together) 
because  heterozygotes  produce  a different  polypeptide  chain  from  each 
allele. 

Enzymes  that  differ  in  electrophoretic  mobility  as  a result  of  allelic  differ- 
ences in  a single  gene  are  called  allozymes.  Hence,  allozyme  variation  in  a 
population  is  an  indication  of  simple  Mendelian  genetic  variation.  Allozyme 
variation  is  widespread  in  almost  all  natural  populations  studied  by 


(A)  Monomorphic  sample 


Figure  2.4  Monomorphism  and  polymorphism.  (A)  Hypothetical  gel  showing 
protein  monomorphism.  All  samples  have  an  enzyme  with  the  same  elec- 
trophoretic mobility.  (B)  Hypothetical  gel  showing  allozyme  polymorphism. 
Eight  samples  are  homozygous  for  an  allele  (F)  that  codes  for  a rapidly  migrat- 
ing enzyme;  two  samples  are  homozygous  for  a different  allele  (S)  that  codes  for 
a slowly  migrating  enzyme;  and  six  samples  are  heterozygous  (F/S)  and  there- 
fore exhibit  enzyme  bands  corresponding  to  both  alleles. 
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electrophoresis,  including  organisms  such  as  bacteria,  plants.  Drosophila, 
mice,  and  human  beings. 


PROBLEM  2.2  A sample  of  35  organisms  from  a Texas  population  of 
the  wild  annual  plant  Phlox  drummondii  were  examined  for  the  elec- 
trophoretic mobility  of  the  enzyme  alcohol  dehydrogenase  (Levin 
1978).  Two  alleles  affecting  electrophoretic  mobility  were  found — 
Adha  and  Adhb.  The  genotype  frequencies  observed  in  the  sample 
were  0.04  Adha/Adh\  0.32  Adlf/Adhb,  and  0.64  Adhb/Adhb.  Estimate 
the  allele  frequency  of  Adlf  and  its  standard  error. 


ANSWER  Let  p represent  the  allele  frequency  of  Adlf.  Then  p = 0.04 
+ 0.32/2  = 0.20.  The  standard  error  equals  -/(0.20)(1  -0.20)/ (2  x 35)  = 
0.05. 


PROBLEM  2.3  From  a natural  population  of  Drosophila  melanogaster 
in  Raleigh,  North  Carolina,  660  fertilized  females  were  trapped  and 
used  to  found  a large  laboratory  population  (Mukai  et  al.  1974).  After 
about  five  months  (10  generations),  489  third  chromosomes  in  the 
population  were  examined  for  allozymes  coding  for  the  enzymes 
esterase-6  (alleles  E6f  and  E6S),  esterase-C  (alleles  ECF  and  ECS),  and 
octanol  dehydrogenase  (alleles  OdhF  and  Odhs).  The  order  of  the 
genes  in  the  third  chromosome  is  known  to  be  E6-EC-Odh.  The 
results  were  as  follows: 


E6f  ECf  OdhF 

152 

E6S  ECf  OdhF 

264 

E6f  ECf  Odhs 

7 

E6S  ECf  Odhs 

13 

E6f  ECs  OdhF 

15 

E 6s  ECS  OdhF 

29 

E6f  ECs  Odhs 

1 

E 6s  ECS  Odhs 

8 

Estimate  the  allele  frequencies  and  their  standard  errors  for  E6F  and 
E6S,  for  ECf  and  ECS,  and  for  OdhF  and  Odhs.  What  number  of  each  of 
the  chromosome  types  is  expected  assuming  that  the  alleles  are  asso- 
ciated at  random? 
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ANSWER  For  esterase-6,  there  were  175  E6F  and  314  E6S  alleles, 
yielding  p = 175/489  = 0.358  for  E6F  and  q = 314/489  = 0.642  for 
E6S;  the  standard  error  is  the  same  for  both  estimates  and  equals 
V(0.358)(0.642)/ 489=  0.022.  For  the  other  alleles,  the  estimates  and 
their  standard  errors  are  0.892  ± 0.014  for  ECF  and  0.108  ± 0.014  for 
EC ; and  0.941  ± 0.011  for  OdhF  and  0.059  ± 0.011  for  Odhs.  Assuming 
random  combinations,  the  expected  number  of  each  chromosome 
type  equals  the  product  of  the  allele  frequencies  times  489.  For  exam- 
ple, for  E6f  ECf  OdhF,  the  expected  number  is  0.358  x 0.892  x 0.941  x 
489  = 146.8.  The  expected  numbers  (observed  in  parentheses)  for  all 
eight  chromosome  types  are:  146.8  (152),  263.4  (264),  9.2  (7),  16.6  (13), 
17.8  (15),  32.0  (29),  1.1  (1),  2.0  (8).  The  model  of  random  combinations 
of  alleles  fits  very  well. 


The  Southern  Blot  Procedure 

Like  polypeptides,  DNA  fragments  can  be  separated  by  electrophoresis. 
Unlike  a polypeptide,  which  has  a predetermined  size  according  to  the  num- 
ber of  amino  acids  it  contains,  a molecule  of  chromosomal  DNA  is  random- 
ly sheared  into  fragments  of  various  size  during  purification.  Therefore,  in 
any  DNA  preparation,  the  DNA  fragments  containing  a particular  sequence 
have  a range  of  sizes  depending  on  where  on  each  side  of  the  sequence  the 
chromosomal  DNA  became  sheared.  Fortunately,  there  is  a class  of  enzymes 
that  cleaves  DNA  at  particular  sites  along  the  molecule.  Consequently,  when 
chromosomal  DNA  is  cleaved  with  such  an  enzyme,  each  DNA  fragment 
containing  a particular  sequence  is  cut  at  the  same  sites  on  either  side  and  so 
will  have  the  same  length. 

The  enzymes  that  cleave  DNA  at  particular  sites  are  called  restriction 
enzymes.  Each  type  of  restriction  enzyme  cuts  double-stranded  DNA  at  all 
sites  at  which  there  is  a particular  nucleotide  sequence  called  the  restriction 
site  of  the  enzyme.  Examples  of  restriction  enzymes  and  their  restriction  sites 
are  shown  in  Figure  2.5;  the  cuts  are  made  at  the  positions  of  the  arrows.  For 
example,  the  enzyme  A/id  cuts  at  sites  of  the  four-nucleotide  sequence 
AGCT,  and  EcoRI  cuts  at  the  six-nucleotide  sequence  GAATTC.  Most  restric- 
tion enzymes  used  in  population  studies  have  either  four-nucleotide  or  six- 
nucleotide  restriction  sites. 

DNA  is  also  unlike  an  enzyme  in  that  it  lacks  any  catalytic  activity  that 
can  be  used  to  determine  the  location  of  a band  in  a gel.  On  the  other  hand, 
any  single  strand  of  DNA  is  able  to  form  a double-stranded  molecule  by 
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Restriction  enzyme 


A/i/I 


Hhal 


Hae  III 


EcoRI 


BamHl 


Xhol 


Restriction  site 


5-AGCT-3' 
3'-  TCGA-5' 


5'-GCGC-3' 

3'-CGCG-5' 


5-GGCC-3' 

3'-CCGG-5' 


1 

5'-GAATTC  -3' 
3-CTTAAG-5' 

t 

1 

5'-GGATCC  -3' 
3'-CCTAGG  -5' 

t 

i 

5'-CTCGAG  -3' 
3-GAGCTC  -5' 


Figure  2.5  Restriction  enzymes  cleave  DNA  molecules  at  sites  of  specific, 
short  nucleotide  sequences.  More  than  500  different  restriction  enzymes  are 
commercially  available.  They  are  essential  tools  in  DNA  analysis  and  gene 
cloning.  The  cleavage  site  in  each  DNA  strand  is  indicated  by  the  arrow. 


pairing  with  another  strand  having  the  complementary  base  sequence.  This 
pairing  of  complementary  DNA  strands  is  the  physical  basis  of  the  most 
widely  used  procedure  for  identifying  DNA  fragments  in  a gel;  the  proce- 
dure, illustrated  in  Figure  2.6,  is  a Southern  blot.  The  reagent  used  for  iden- 
tification is  a molecule  of  DNA  called  the  probe,  which  contains  the 
nucleotide  sequence  of  interest.  Probe  DNA  is  usually  obtained  from  a gene 
that  has  been  cloned  (for  example,  into  a bacterial  cell)  or  by  amplification 
with  the  polymerase  chain  reaction  (described  in  the  next  section).  In  the 
Southern  procedure,  DNA  restriction  fragments  that  have  been  separated  by 
electrophoresis  are  rendered  single-stranded  by  soaking  in  a solution  of  sodi- 
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Figure  2.6  Southern  blot  procedure.  (A)  DNA  fragments  separated  by  elec- 
trophoresis are  transferred  and  chemically  attached  to  a filter.  (B)  The  filter  is 
mixed  with  radioactive  probe  DNA,  which  sticks  to  homologous  DNA  mole- 
cules in  the  filter.  (C)  After  washing,  the  filter  is  exposed  to  photographic  film, 
which  develops  dark  bands  caused  by  radioactive  emissions  from  the  probe. 


um  hydroxide,  then  blotted  onto  a nitrocellulose  or  nylon  filter  where  subse- 
quent chemical  treatment  attaches  them  (Figure  2.6A).  The  filter  is  then 
bathed  in  a solution  containing  probe  DNA  that  has  been  rendered  radioac- 
tive (part  B).  As  the  solution  cools,  the  probe  DNA  strands  form  double- 
stranded  molecules  with  their  complementary  counterparts  on  the  filter,  and 
careful  washing  removes  all  of  the  probe  DNA  that  has  remained  unpaired. 
The  filter  is  sandwiched  with  photographic  film,  where  radioactive  disinte- 
grations from  the  bound  probe  result  in  visible  bands  (part  C).  Alternatively, 
the  probe  may  be  chemically  modified  and  the  bands  visualized  by  fluores- 
cence or  staining. 

Genetic  differences  resulting  in  the  presence  or  absence  of  restriction  sites 
can  be  identified  because  they  change  the  length  of  characteristic  restriction 
fragments.  An  example  is  illustrated  in  Figure  2.7.  The  upper  part  of  each 
panel  shows  the  location  of  restriction  sites  in  the  DNA  molecules  in  a diploid 
genotype.  The  fl-type  molecule  contains  one  additional  restriction  site  not 
present  in  the  A-type  molecule.  The  lower  part  of  the  figure  demonstrates 
that,  with  suitable  probe  DNA,  all  three  genotypes  can  be  distinguished  by 
their  pattern  of  restriction  fragments.  A difference  in  the  length  of  a restric- 
tion fragment  found  segregating  in  natural  populations  is  called  a restriction 
fragment  length  polymorphism  or  RFLP.  Because  RFLPs  are  widely  distrib- 
uted throughout  the  genome  of  human  beings  and  other  organisms,  they 
have  assumed  major  importance  in  population  genetics. 
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Figure  2.7  Restriction  fragment  length  polymorphisms  (RFLPs)  result  from 
the  presence  or  absence  of  particular  restriction  sites  in  DNA.  In  this  example, 
the  DNA  molecule  designated  A contains  three  restriction  sites,  and  the  one  des- 
ignated a contains  four.  Genotypes  AA,  Aa,  and  aa  each  yield  a different  pattern 
of  bands  in  Southern  blot  using  the  indicated  probe  DNA. 

The  Polymerase  Chain  Reaction 

The  polymerase  chain  reaction  (PCR)  for  the  amplification  of  specific  DNA 
sequences  is  of  great  utility  in  population  genetics  for  the  production  of 
probe  DNA  or  for  the  direct  determination  of  the  amount  of  nucleotide 
sequence  variation  present  in  natural  populations.  The  method  is  outlined  in 
Figure  2.8.  The  original  DNA  sequence  to  be  amplified  is  shown  in  black  and 
the  newly  synthesized  DNA  strands  in  gray.  The  small  ovals  represent  syn- 
thetic oligonucleotides  that  are  complementary  in  sequence  to  the  ends  of 
the  region  to  be  amplified.  The  oligonucleotides  are  called  primer  sequences 
because  they  anneal  to  the  ends  of  the  sequence  to  be  amplified  and  are  used 
as  primers  for  chain  elongation  by  DNA  polymerase.  Primer  oligonu- 
cleotides are  typically  18-22  nucleotides  in  length.  DNA  to  be  used  as  the 
template  in  a PCR  reaction  is  first  mixed  with  both  primers  along  with  a ther- 
mostable DNA  polymerase  in  a buffer  solution.  The  PCR  amplification  takes 
place  in  cycles.  In  the  first  cycle,  the  DNA  is  heated  to  separate  the  strands 
and  then  cooled  in  the  presence  of  a vast  excess  of  the  primer  oligonu- 
cleotides. Then  elongation  of  the  primers  produces  double-stranded  mole- 
cules. The  second  cycle  of  PCR  is  similar  to  the  first  but,  after  the  second 
cycle,  there  are  four  copies  of  each  original  molecule.  The  cycle  is  repeated 
from  20  to  30  times,  each  resulting  in  a doubling  of  the  number  of  molecules. 
The  theoretical  result  of  n rounds  of  amplification  is  2"  copies  of  each  tem- 
plate molecule  originally  present. 
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Figure  2.8  The  polymerase  chain  reaction  (PCR).  Short  primer  oligonucleo- 
tides are  used  as  primers  to  initiate  DNA  replication  from  opposite  ends  of  a 
DNA  duplex  to  be  amplified.  After  each  round  of  replication,  the  DNA  is  heated 
to  separate  the  strands  and  then  cooled  to  allow  new  primers  to  anneal.  Repeat- 
ed rounds  of  replication  result  in  an  exponential  increase  in  the  number  of  tar- 
get molecules. 


PCR  amplification  is  very  useful  in  generating  large  quantities  of  a specif- 
ic DNA  sequence  without  the  need  for  cloning.  The  main  limitation  of  the 
technique  is  that  the  DNA  sequences  at  the  ends  of  the  region  to  be  amplified 
must  be  known  so  that  primer  oligonucleotides  can  be  synthesized.  There  are 
many  applications  in  which  this  requirement  is  met.  In  population  genetics, 
for  example,  PCR  can  be  used  to  amplify  different  alleles  present  in  natural 
populations. 


PROBLEM  2.4  PCR  was  used  to  amplify  five  alleles  (designated/-/) 
of  the  gene  Rh3  coding  for  a light-sensitive  protein  in  the  eye  of 
Drosophila  simulans,  a species  of  fruit  fly  closely  related  to  D. 
melanogaster . The  resulting  DNA  fragments  were  sequenced  (Ayala  et 
al.  1993).  The  data  show  the  nucleotide  present  at  each  of  16  polymor- 
phic nucleotide  sites  found  in  the  first  500  nucleotide  sites  in  the 
amino  acid  coding  region  of  the  gene;  the  remaining  484  nucleotide 
sites  were  monomorphic  in  this  sample.  Any  nucleotide  site  that  is  an 
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exact  multiple  of  three  is  at  the  third  position  of  a codon.  In  this  region 
of  the  gene: 

(a)  what  proportion  of  polymorphic  nucleotide  sites  are  in  third  posi- 
tions of  codons?  What  can  you  infer  from  this  observation? 

(b)  what  proportion  of  nucleotide  sites  are  polymorphic? 

(c)  why  is  the  standard  error  formula  not  appropriate  for  the  estimate 
in  part  ( b )? 
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ANSWER  (a)  Among  the  16  polymorphic  sites,  only  site  142  is  not 
an  exact  multiple  of  three,  hence  15/16  = 94%  of  the  polymorphic 
sites  are  in  the  third  codon  position.  The  inference  is  that  many  of  the 
nucleotide  polymorphisms  are  silent  (synonymous)  in  that  they  do 
not  alter  the  amino  acid  sequence  of  the  polypeptide.  (In  fact,  all  16 
are  silent  polymorphisms,  including  the  C -»  T change  in  142,  which 
alters  the  codon  from  CUA  — » UUA,  both  of  which  code  for  leucine.) 
(b)  A total  of  16/500  = 3.2%  of  the  nucleotide  sites  are  polymorphic  in 
this  region  of  the  gene,  (c)  The  binomial  standard  error  is  not  appro- 
priate in  this  case  because  the  nucleotides  within  a gene  are  not  inde- 
pendent samples;  they  are  genetically  closely  linked.  (A  suitable 
estimate  of  the  standard  error  is  given  later  in  this  chapter.) 


POLYMORPHISM  AND  HETEROZYGOSITY 

Monomorphism  or  polymorphism  of  a gene  in  a sample  is  usually  of  interest 
only  insofar  as  it  indicates  monomorphism  or  polymorphism  of  the  gene  in  the 
population  as  a whole.  In  a population,  a polymorphic  gene  is  one  for  which 
the  most  common  allele  has  a frequency  of  less  than  0.95  (some  authors  prefer 
a more  stringent  cutoff  at  0.99).  Conversely,  a monomorphic  gene  is  one  that  is 
not  polymorphic.  The  cutoff  at  0.95  (sometimes  0.99)  in  the  definition  of  poly- 
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morphism  is  arbitrary,  but  it  serves  to  focus  attention  on  those  genes  in  which 
allelic  variation  is  common.  In  any  large  population,  rare  alleles  are  observed 
for  virtually  every  gene.  An  allele  is  considered  a rare  allele  if  its  frequency  is 
less  than  0.005;  in  human  beings,  between  one  and  two  people  per  thousand  are 
heterozygous  for  rare  alleles  of  any  gene.  Many  rare  alleles  are  deleterious  and 
are  presumably  maintained  in  the  population  by  recurrent  mutation.  The  defi- 
nition of  polymorphism  is  an  attempt  to  focus  on  genes  that  have  alleles  with 
frequencies  too  high  to  be  explained  solely  by  recurrent  mutation  to  harmful 
alleles.  With  the  0.95  definition  of  polymorphism  given  above,  and  if  alleles  are 
combined  at  random  into  genotypes,  then  at  least  9.5%  of  the  population  is  het- 
erozygous for  the  most  common  allele  (because  2 x 0.95  x 0.05  = 0.095). 

Allozyme  Polymorphisms 

Polymorphism  of  alleles  that  determine  allozymes  is  extremely  widespread. 
Figure  2.9  summarizes  the  results  of  electrophoretic  surveys  of  14  to  71 
(mostly  around  20)  genes  in  populations  of  243  species.  Each  point  in  the  fig- 
ure (except  that  for  human  beings)  gives  the  type  of  organism  studied  and 
the  number  of  species  examined.  The  axis  labeled  Polymorphism  refers  to 
the  estimated  proportion  of  genes  that  are  polymorphic  by  the  0.95  criterion. 
The  axis  labeled  Heterozygosity  refers  to  the  average  heterozygosity  in  each 
group.  The  average  heterozygosity  is  the  estimated  proportion  of  genes 
expected  to  be  heterozygous  in  an  average  organism;  it  is  estimated  as  the 
proportion  of  heterozygous  genotypes  for  each  gene  averaged  over  all  genes. 
For  example,  the  data  for  Europeans  include  an  English  population  in  which 
10  enzyme  genes  were  examined  (Harris  1966).  Of  the  10  genes,  three  were 
found  to  be  polymorphic,  from  which  the  estimated  proportion  of  polymor- 
phic genes  in  the  genome  is  3/10  = 0.3.  The  observed  proportion  of  het- 
erozygous genotypes  for  each  of  the  three  polymorphic  genes  was  0.509  (for 
red-cell  acid  phosphatase),  0.385  (for  phosphoglucomutase),  and  0.095  (for 
adenylate  kinase);  the  average  heterozygosity  in  this  sample — taking  into 
account  the  additional  seven  genes  for  which  the  observed  heterozygosity 
was  0 — is  therefore  (0.509  + 0.385  + 0.095  + 7 x 0)/10  = 0.099. 

The  vertical  and  horizontal  bars  on  the  point  corresponding  to  Drosophila 
indicate  the  size  of  the  standard  error  of  the  estimate.  Therefore,  the  bars 
indicate  the  limits  of  polymorphism  and  heterozygosity  within  which  about 
68%  of  the  species  are  expected  to  fall.  Among  Drosophila  species,  approxi- 
mately 68%  have  a proportion  of  polymorphic  genes  in  the  range  0.30-0.56 
and  an  average  heterozygosity  in  the  range  0.09-0.19.  Such  bars  could  be 
attached  to  each  point;  their  lengths  would  be  comparable  to  those  for 
Drosophila,  indicating  substantial  variability  in  polymorphism  and  heterozy- 
gosity among  species  within  groups. 

Figure  2.9  has  no  simple  summary  because  of  the  immense  variability  in 
polymorphism  and  heterozygosity  found  within  each  group  of  organisms  (as 
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Figure  2.9  Estimated  levels  of  heterozygosity  and  proportion  of  polymorphic 
genes  derived  from  allozyme  studies  of  various  groups  of  plants  and  animals. 
The  number  of  species  studied  is  shown  in  parenthesis  beside  each  point. 
Squares  denote  averages  for  plants,  invertebrates,  and  vertebrates.  The  bars 
across  the  Drosophila  point  indicate  the  standard  error  within  which  about  68% 
of  the  species  are  expected  to  fall.  Other  groups  have  similarly  large  standard 
errors.  (Data  from  Nevo  1978.) 


indicated  by  the  length  of  the  variability  bars  corresponding  to  Drosophila). 
On  the  whole,  there  is  a positive  relationship  between  amount  of  polymor- 
phism and  degree  of  heterozygosity.  This  relationship  is  as  expected  because 
the  greater  the  fraction  of  polymorphic  genes  in  a population,  the  more  genes 
that  are  expected  to  be  heterozygous  on  the  average.  The  overall  mean  poly- 
morphism in  Figure  2.9  is  0.26  ± 0.15,  and  the  mean  heterozygosity  is  0.07  ± 
0.05.  Vertebrates  have  the  lowest  average  amount  of  genetic  variation  among 
the  groups  in  Figure  2.9,  plants  come  next,  and  invertebrates  have  the  high- 
est. Drosophila  is  the  most  genetically  variable  group  of  higher  organisms  so 
far  studied,  and  mammals  the  least  variable.  Human  beings  are  fairly  typical 
of  large  mammals:  An  extensive  electrophoretic  survey  of  104  genes  in  a sam- 
ple including  all  major  human  races  gave  estimates  of  polymorphism  of  0.32 
and  heterozygosity  of  0.06  (Harris  et  al.  1977).  The  one  obvious  conclusion 
that  can  be  reached  from  Figure  2.9  is  that  allozyme  polymorphisms  are 
widespread  among  higher  organisms.  Genetic  variation  is  even  more  preva- 
lent among  some  prokaryotes.  For  example,  natural  isolates  of  the  mam- 
malian intestinal  bacterium  Escherichia  coli  exhibit  levels  of  genetic 
polymorphism  two  or  three  times  greater  than  vertebrates  (Selander  et  al. 
1987). 
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Although  genetic  polymorphisms  are  widespread,  they  are  not  universal. 
For  example,  both  major  subspecies  of  the  cheetah  Acinonynx  jubatus  are  vir- 
tually monomorphic  (O  Brien  et  al.  1987).  A survey  of  49  enzymes  among  30 
animals  from  the  East  African  subspecies  {A.  j.  raineyi)  yielded  only  two  poly- 
morphic genes  and  estimates  of  polymorphism  of  0.04  and  heterozygosity  of 
0.01;  among  98  animals  from  the  South  African  species  ( A . j.  jubatus),  the  esti- 
mate of  polymorphism  was  0.02  and  that  of  heterozygosity  0.0004.  Most 
unusual  was  the  finding  of  skin-graft  acceptance  between  unrelated  cheetahs 
from  the  South  African  subspecies.  Graft  acceptance  means  that  the  cheetah 
population  is  monomorphic  for  the  major  histocompatibility  locus,  which  is 
abundantly  polymorphic  in  other  mammals.  Apparently,  the  cheetah,  which 
was  worldwide  in  its  range  at  one  time  but  presently  numbers  less  than 
20,000  animals,  underwent  at  least  two  severe  constrictions  in  population 
number  resulting  in  the  loss  of  most  of  its  genetic  variability. 

How  Representative  Are  Allozymes? 

The  generality  of  estimates  of  polymorphism  based  on  electrophoresis  is 
somewhat  uncertain.  The  amount  of  polymorphism  may  be  underestimated 
because  conventional  electrophoresis  fails  to  detect  many  amino  acid 
replacements.  For  example,  in  a study  of  14  myoglobin  proteins  from  various 
species  including  cetaceans  (whales,  dolphins  and  porpoises),  no  more  than 
eight  could  be  distinguished  by  conventional  electrophoresis;  however,  13 
could  be  distinguished  by  varying  the  pH  value  of  the  electrophoresis  buffer 
(McLellan  and  Inouye  1986).  Some  amino  acid  replacements  can  be  detected 
because  they  render  the  enzyme  sensitive  to  high  temperatures;  a test  for 
temperature  sensitivity  increased  the  number  of  identified  alleles  of  the  gene 
coding  for  xanthine  dehydrogenase  in  Drosophila  pseudoobscura  from  6 to  37 
and  increased  the  estimate  of  average  heterozygosity  from  0.44  to  0.73  (Singh 
et  al.  1976).  On  the  other  hand,  although  more  elaborate  techniques  reveal 
additional  alleles  of  genes  known  to  be  polymorphic,  thus  increasing  esti- 
mates of  heterozygosity,  genes  classified  as  monomorphic  by  means  of  rou- 
tine electrophoresis  tend  to  remain  monomorphic,  and  so  estimates  of  poly- 
morphism remain  much  the  same  as  before. 

Electrophoretic  surveys  might  also  overestimate  the  amount  of  polymor- 
phism because  the  enzymes  typically  surveyed  are  those  found  in  relatively 
high  concentration  in  tissues  or  body  fluids  ("Group  I enzymes")  and  often 
lack  the  high  substrate  specificity  of  enzymes  implicated  in  central  metabol- 
ic processes  ("Group  II  enzymes").  For  example,  among  10  Group  I and  11 
Group  II  enzymes  in  Drosophila,  estimates  of  polymorphism  and  heterozy- 
gosity were  0.70  and  0.24  in  the  former  and  0.27  and  0.04  in  the  latter  (Gilles- 
pie and  Langley  1974).  In  summary,  protein  electrophoresis  is  a convenient 
method  for  detecting  polymorphisms,  but  it  is  difficult  to  extrapolate  from 
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electrophoretic  surveys  of  enzymes  to  the  entire  genome  because  the 
enzymes  may  not  be  representative. 

Polymorphisms  in  DNA  Sequences 

One  inevitable  limitation  of  protein  electrophoresis  is  the  inability  to  detect 
variation  in  a nucleotide  sequence  that  does  not  alter  the  amino  acid 
sequence.  A polymorphism  is  silent  if  it  is  present  in  the  coding  region  but 
does  not  alter  the  amino  acid  sequence;  many  nucleotide  differences  in  third- 
codon  position  are  of  this  type.  A polymorphism  is  noncoding  if  it  affects 
nucleotides  in  noncoding  regions  such  as  the  upstream  region,  the  down- 
stream region,  or  introns.  Silent  and  noncoding  polymorphisms  may  have 
subtle  effects  on  the  organism,  and  the  alleles  may  be  affected  by  natural 
selection;  the  polymorphic  alleles  are  silent  or  noncoding  only  in  the  sense 
that  they  all  code  for  the  same  amino  acid  sequence.  An  example  of  exten- 
sive silent  polymorphism  in  Drosophila  is  illustrated  in  Figure  2.10  for  alleles 
of  the  gene  coding  for  alcohol  dehydrogenase.  This  gene  has  an  elec- 
trophoretic polymorphism  that  is  widespread  in  natural  populations  with 
two  predominant  alleles,  slow  (. Adh-S ) and  fast  (. Adh-F ).  The  molecular  dif- 
ference is  that,  in  the  fourth  and  last  exon  of  the  gene,  the  codon  for  amino 
acid  number  193  in  Adh-S  is  AAG  (lysine)  and  in  Adh-F  is  ACG  (threonine). 
The  enzymes  differ  not  only  in  electrophoretic  mobility.  The  product  of  the 
fast  allele  has  a greater  enzymatic  activity  and  is  also  synthesized  in  greater 
amount  than  that  of  the  slow  allele. 

The  data  in  Figure  2.10  are  derived  from  studies  of  RFLPs  in  the  Adh 
region  of  1533  flies  isolated  from  25  populations  throughout  eastern  North 
America  (Berry  and  Kreitman  1993).  A total  of  113  haplotypes  were  identi- 
fied. A haplotype  is  a unique  combination  of  genetic  markers  present  in  a 
chromosome.  In  Figure  2.10,  the  haplotypes  indicated  with  squares  are  Adh-F 
and  those  with  circles  are  Adh-S.  The  number  inside  each  symbol  is  the  rela- 
tive abundance  of  the  haplotype  (1  being  the  most  frequent,  2 the  next  most 
frequent,  and  so  forth).  A straight  line  connecting  two  haplotypes  indicates 
that  they  differ  by  a single  change.  Figure  2.10  includes  93  haplotypes  related 
to  at  least  one  other  by  a singe  change;  the  other  20  haplotypes  observed  in 
the  study  include  additional  changes.  The  main  point  of  the  Adh  example  is 
that  natural  populations  contain  a great  abundance  of  different  types  of 
nucleotide-sequence  variation  that  does  not  affect  amino  acid  sequence. 

Nucleotide  Polymorphism  and  Nucleotide  Diversity 

Sequence  data  can  be  used  quantitatively  to  estimate  the  level  of  genetic  vari- 
ation at  the  nucleotide  level.  The  data  in  Problem  2.4  are  typical  and  so  will 
be  used  to  exemplify  the  calculations.  The  level  of  nucleotide  polymor- 
phism, symbolized  0,  is  the  proportion  of  nucleotide  sites  that  are  expected 
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Figure  2.10  Haplotypes  of  alleles  in  the  Adh  region  of  Drosophila  melanogaster 
from  the  East  Coast  of  North  America.  Each  line  in  the  network  connects  two 
haplotypes  differing  by  a single  molecular  difference.  An  additional  20  haplo- 
types, differing  by  more  than  one  change  from  those  in  the  network,  are  not 
shown.  Squares  indicate  the  Adh-F  allele,  circles  the  Adh-S  allele.  (From  Berry 
and  Kreitman  1993.) 
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to  be  polymorphic  in  any  suitable  sample  from  this  region  of  the  genome. 
The  estimate  0 equals  the  proportion  of  nucleotide  polymorphism  observed 
in  the  sample,  often  symbolized  as  S,  divided  by 


n- 1 


i= 1 


2.4 


where  n is  the  size  of  the  sample.  In  this  case,  S = 16/500  = 0.032  for  a sam- 
ple of  size  n = 5,  so  that  a,  = 1/1  + 1/2  + 1/3  + 1/4  = 2.083.  The  estimate  of 
0,  per  nucleotide  site,  is  therefore 


- _ 5 _ 0.032 
~ ax~  2.083 


0.015 


2.5 


As  noted  in  Problem  2.4,  the  variance  of  0 is  not  binomial  because,  owing 
to  genetic  linkage,  successive  nucleotides  cannot  be  regarded  as  realizations 
of  independent  trials.  An  approximation  to  the  variance  can  be  derived  under 
the  assumption  that  the  nucleotides  at  a site  are  functionally  equivalent  or 
invisible  to  natural  selection;  the  mathematical  details  are  beyond  the  scope 
of  this  book,  but  the  result  is  quite  simple.  The  variance  of  0,  per  nucleotide 
site,  is  given  by 

P(0)  = A + i?A  2.6 


where  a q is  as  defined  in  Equation  2.4,  k is  the  number  of  nucleotides  in  each 
sequence  (in  our  example,  k = 500),  and  a2  is  a function  of  the  number  of  alle- 
les n in  the  sample,  namely 


«2  =X 


1 


2.7 


For  n = 2 through  10,  the  values  of  a2  are  1, 1.25, 1.36, 1.42, 1.46, 1.49, 1.51, 
1.523,  1.54.  In  the  case  at  hand,  n = 5 and  the  estimated  variance  of  0 = 
0.015/ (500  x 2.083)  + 1.42  x 0.015“/2.083“  = 9.2131  x 10  5.  The  standard  error 
of  0 is  the  square  root  of  the  variance  or,  in  this  case,  0.0096  per  nucleotide 
site. 

A second  quantity  used  to  assess  polymorphisms  at  the  DNA  level  is  the 
nucleotide  diversity,  typically  denoted  n,  which  is  the  average  proportion  of 
nucleotide  differences  between  all  possible  pairs  of  sequences  in  the  sample. 
In  a sample  of  n sequences,  there  are  n(n  - l)/2  pairwise  comparisons.  For 
the  data  in  Problem  2.4,  n = 5,  and  so  there  are  10  pairwise  comparisons.  The 
pairwise  comparisons  may  be  considered  for  each  nucleotide  in  turn  and  the 
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differences  averaged  later.  For  the  polymorphic  sites  in  Problem  2.4,  the  num- 
ber of  pairwise  differences  is  6 (=  2 x 3)  for  sites  132, 142,  246,  351,  405,  and 
483;  it  is  4 (=  1 x 4)  for  sites  162, 198,  201,  207,  240,  354,  372,  375,  and  417;  and 
it  is  7 for  site  192.  Among  the  484  monomorphic  nucleotides  in  Problem  2.4, 
the  number  of  pairwise  differences  is  0.  The  average  proportion  of  pairwise 
differences  between  the  sequences  in  the  sample  is  the  estimate — ft — of  the 
nucleotide  diversity;  hence, 

ft  = (6x6  + 4x9  + lx7  + 0x  484)/(10  x 500)  = 0.016 


The  variance  of  ft  is  estimated  as  follows: 

Var(ft)  = — ft  + bin2  2.8 

k 

nucleotides  and  where 

2.9 


2.10 

For  example,  when  n = 5,  then  bx  = 0.5  and  b2  = 0.37,  and  so  Var(ft)  = 
(0.5/500)  x 0.016  + 0.37  x 0.0162  = 0.000107;  the  standard  error  of  ft  is  the 
square  root,  or  0.010. 

The  estimates  of  0 and  n based  on  nucleotide  sequences  are  not  readily 
convertible  to  levels  of  polymorphism  and  heterozygosity  expected  at  the 
protein  level.  The  main  reason  is  that  most  observed  nucleotide  polymor- 
phisms are  either  silent  or  noncoding  and  so  do  not  change  the  amino  acid 
sequence  of  the  polypeptide.  The  level  of  protein  polymorphism  is  deter- 
mined to  a large  extent  by  the  degree  to  which  the  amino  acid  sequence  is 
constrained  by  natural  selection  against  variant  sequences  (or,  in  some  cases, 
by  natural  selection  for  variant  sequences),  and  constraints  at  the  protein 
level  are  not  generally  predictable  from  0 and  n. 

On  the  other  hand,  there  is  a theoretical  relation  between  0 and  n that  is 
expected  under  the  simplifying  assumption  that  the  alleles  are  invisible  to 
natural  selection.  The  theoretical  basis  of  relation  between  0 and  n is  dis- 
cussed in  connection  with  the  neutral  theory  of  molecular  evolution  in  Chap- 
ter 8,  but  the  expected  relation  is  that  0 = jt.  For  the  data  in  Problem  2.4,  for 
example,  0 = 0.015;  this  number  is  to  be  compared  with  ft  = 0.016,  and  so  the 
agreement  with  expectation  is  quite  good.  (On  the  other  hand,  the  sample 
size  is  very  small.) 

Estimates  of  nucleotide  polymorphism  and  diversity  can  also  be  carried 
out  with  restriction-site  data  in  the  form  of  restriction  fragment  length  poly- 


where  k is  again  the  length  of  the  sequences  in 

n + 1 


b i = ■ 


b-,  = 


3 (n  - 1) 

2(n2  +n  + 3) 
9n(n  - 1) 
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morphisms  (RFLPs).  The  simplest  way  to  proceed  is  to  analyze  the  restriction 
sites  in  turn.  Each  monomorphic  restriction  site  is  regarded  as  identifying  six 
adjacent  monomorphic  nucleotides  (or  four  monomorphic  nucleotides,  if  the 
enzyme  has  a four-base  restriction  site).  Each  polymorphic  restriction  site  is 
regarded  as  identifying  five  monomorphic  nucleotides  and  one  polymorphic 
nucleotide  (or  three  monomorphic  and  one  polymorphic,  if  the  enzyme  has  a 
four-base  restriction  site).  In  other  words,  each  restriction  site  polymorphism 
is  supposed  to  result  from  polymorphism  of  a single  nucleotide  in  the  restric- 
tion site.  Pairwise  comparisons  to  estimate  n are  carried  out  under  this 
assumption.  The  reasoning  is  illustrated  in  the  following  problem. 


PROBLEM  2.5  Restriction-site  variation  was  studied  around  the 
gene  for  alcohol  dehydrogenase  (Adh)  in  a population  of  D. 
melanogaster  descended  from  animals  trapped  at  a Dutch  fruit  market 
in  Groningen  (Cross  and  Birley  1986).  The  region  contained  a total  of 
23  sites  for  five  restriction  enzymes,  each  having  a six-base  restriction 
site.  A total  of  16  sites  were  cut  in  all  flies  in  the  sample.  The  accom- 
panying table  documents  the  presence  (+)  or  absence  (-)  of  each  of  the 
seven  polymorphic  sites  in  a sample  of  10  chromosomes.  Estimate  the 
proportion  of  polymorphic  nucleotides  0,  the  nucleotide  diversity  ft, 
and  the  standard  error  of  each.  Does  the  relation  0 = n seem  to  hold 
for  these  estimates? 
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ANSWER  Consider  first  the  nucleotide  polymorphisms.  The  16 
monomorphic  sites  identify  16  x 6 = 96  monomorphic  nucleotides;  the 
7 polymorphic  sites  identify  7 x 5 = 35  monomorphic  sites  and  7x1 
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polymorphic  nucleotides  (assuming  only  1 nucleotide  is  altered  for 
each  restriction  site  that  is  lost).  Altogether,  there  are  138  nucleotides 
of  which  7 are  polymorphic.  Because  n = 10,  then  a j = 2.83  and  a2  = 
1.54.  The  estimate  of  0 is  therefore  0 = (7/138)/2.83  = 0.0179  per 
nucleotide  site  and  Var(0)  = 0.0179/(138  x 2.83)  + (1.54  x 
0.01792/2.832)  = 1.0778  x 10'4.  The  standard  error  of  0 is  therefore 
0.0104  per  nucleotide  site.  For  estimating  n,  there  are  10  x 9/2  = 45 
pairwise  comparisons,  and  a restriction  site  with  i "plus"  and  (10  - i) 
"minus"  means  that  the  polymorphic  nucleotide  site  results  in  i x (10 
- 0 pairwise  mismatches.  Therefore,  the  total  number  of  mismatches 
for  each  of  the  restriction  sites,  from  left  to  right,  equals  16, 24, 9, 16, 9, 
9,  and  21,  respectively,  totaling  104.  In  addition,  there  are  16  x 6 
nucleotides  (from  the  monomorphic  sites)  and  7x5  nucleotides  (from 
the  polymorphic  sites)  for  which  the  number  of  pairwise  mismatches 
equals  0.  Therefore,  h = 104/(45  x 23  x 6)  = 0.017.  For  n = 10,  b:  = 0.407 
and  b2  = 0.279,  and  so  Var(ft)  = 0.0001277;  hence  the  standard  error 
equals  0.011.  In  these  data,  0 = 0.018  and  ft  = 0.017,  which  are  in  very 
good  agreement.  However,  the  sample  size  is  too  small  to  generalize 
this  conclusion. 


Uses  of  Genetic  Polymorphisms 

Whether  studied  through  allozymes  or  nucleotide  sequences,  natural  genet- 
ic variation  has  many  uses.  Genetic  variation  provides  a set  of  built-in  mark- 
ers for  the  genetic  study  of  organisms  in  their  native  habitats,  including 
organisms  for  which  domestication  or  laboratory  rearing  is  unfeasible  or  for 
which  conventional  genetic  manipulation  is  impossible. 

Genetic  polymorphisms  are  useful  in  investigating  the  genetic  relation- 
ships among  subpopulations  in  a species.  The  principle  is  that  alleles  are 
shared  among  subpopulations  because  of  migration,  and  therefore  similarity 
in  allele  frequencies  among  subpopulations  can  be  used  to  estimate  the  rate 
of  migration  (Chapter  4).  Within  subpopulations,  alleles  are  shared  because 
of  common  ancestry.  For  example,  the  Ainu  people  of  Northern  Japan  have 
numerous  Caucasoid-like  features,  including  their  facial  features,  light  skin, 
and  hairy  bodies,  yet  their  genetic  polymorphisms  clearly  show  them  to  be 
more  closely  related  to  other  Mongoloid  groups  (Watanabe  et  al.  1975). 
Among  the  most  informative  alleles,  the  Ainu  people  possess  the  D(Chi) 
allele  of  transferrin  protein  and  the  Di“  allele  of  the  Diego  blood  group,  both 
of  which  are  virtually  restricted  to  Mongoloid  populations.  Conversely,  the 
Ainu  people  lack  several  alleles  that  are  polymorphic  in  Caucasoids. 
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From  a practical  point  of  view,  genetic  polymorphisms  are  useful  in 
human  populations  as  genetic  markers  that  may  be  genetically  linked  to 
harmful  genes  that  cause  disease.  In  kinships  with  a family  history  of  the  dis- 
ease, the  genetic  markers  can  be  used  to  determine  which  members  of  the 
kindred  are  likely  to  be  carriers  of  the  harmful  gene.  The  markers  can  also  be 
used  in  early  diagnosis  of  persons  likely  to  be  affected.  RFLPs  and  other 
types  of  DNA  polymorphisms  that  are  linked  to  disease  genes  have  also 
demonstrated  their  utility  as  probes  for  identifying  recombinant  DNA  clones 
containing  the  defective  genes.  The  nearby  genetic  markers  enable  the  defec- 
tive gene  and  its  function  to  be  identified,  thus  serving  as  a first  step  in  the 
search  for  effective  treatments. 

Particularly  useful  in  population  genetics  are  DNA  markers  with  a large 
number  of  alleles  of  moderate  frequency.  In  most  organisms,  many  regions  of 
the  genome  have  multiple  alleles  consisting  of  a short  sequence  of  bases 
repeated  in  tandem.  Multiple  alleles  result  because  the  number  of  copies  of 
the  repeated  sequence  may  differ  from  one  chromosome  to  the  next.  The 
genotypes  are  even  more  variable  because  each  genotype  carries  two  alleles. 
One  of  the  practical  applications  of  the  use  of  such  polymorphisms  is  in  DNA 
typing,  in  which  the  alleles  in  the  DNA  from  a suspect  are  matched  with  those 
from  a crime-scene  sample.  The  examination  of  a sufficient  number  of  such 
highly  variable  regions  provides  a basis  for  distinguishing  one  person  from 
another  because  no  two  people  (with  the  exception  of  identical  twins)  have 
the  same  genotype.  Genetic  variability  of  this  sort  is  used  in  determining 
paternity  as  well  as  in  criminal  investigations.  The  experimental  methods  of 
DNA  typing,  and  certain  relevant  issues  in  population  genetics,  are  discussed 
in  Chapter  4. 

DNA  typing  has  also  been  applied  to  studies  of  the  natural  mating  sys- 
tems of  plants  and  animals  because,  with  the  large  number  and  high  speci- 
ficity of  DNA  types,  close  relatives  can  be  detected  in  populations.  In 
behavioral  studies,  DNA  typing  can  determine  whether  organisms  that  per- 
form mutually  altruistic  acts  are  genetically  related.  Polymorphisms  of  other 
types  can  also  be  informative  about  mating  systems.  For  example,  the 
observed  frequencies  of  genotypes  can  be  used  to  estimate  the  amount  of 
self-fertilization  in  populations  of  monoecious  plants  or  hermaphroditic 
animals. 

From  the  standpoint  of  evolutionary  biology,  sequences  of  genes  and  pat- 
terns of  polymorphism  can  be  used  to  make  inferences  about  evolutionary 
history  and  about  the  evolutionary  process.  The  sequences  of  macromole- 
cules contain  within  themselves  a record  of  their  evolutionary  history.  Organ- 
isms with  a shared  ancestry  usually  have  similar  gene  sequences.  Conversely, 
similarity  in  sequence  can  be  regarded  as  a measure  of  shared  ancestry.  As  an 
index  of  shared  ancestry,  sequence  similarity  provides  a means  of  inferring 
the  ancestral  relationships  among  a group  of  organisms  ( molecular  phyloge- 
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netics,  discussed  in  Chapter  8).  The  rates  and  patterns  of  change  in  sequence 
within  species  and  between  closely  related  species  also  contain  a record  of 
evolutionary  forces  at  work.  Within  the  past  20  years,  population  genetics  has 
gone  from  a data-poor  field  to  a data-rich  field,  and  numerous  new  methods 
of  data  analysis  and  hypothesis  testing  have  been  developed. 

MULTIPLE-FACTOR  INHERITANCE 

We  have  seen  that  Galton  and  Mendel  chose  opposite  types  of  traits  for  their 
studies  of  variation:  Galton  chose  continuous  traits,  Mendel  discrete  traits. 
The  choices  reflected  a deep  difference  of  opinion  in  the  manner  in  which 
inheritance  should  be  studied.  Gabon's  approach  was  empirical,  based  on 
the  observed  similarity  between  relatives  such  as  parents  and  offspring. 
Mendel's  approach  was  theoretical,  based  on  unobserved  segregating  factors 
that  determined  the  patterns  of  inheritance.  Even  after  the  rediscovery  of 
Mendel's  paper  in  1900,  the  disciples  of  Galton  (called  "biometricians")  dis- 
missed its  significance,  claiming  that  the  postulated  Mendelian  factors  were 
not  only  irrelevant  for  continuous  traits  but  also  inadequate  to  explain  the 
observed  correlations  between  relatives.  The  Mendelians  argued  that  segre- 
gation and  independent  assortment  could  explain  continuous  traits  just  as 
well  as  discrete  traits.  The  acrimonious  dispute  between  the  biometricians 
and  the  Mendelians  continued  for  nearly  20  years. 

The  dispute  abated  substantially  with  a 1918  paper  by  the  statistician 
Ronald  Aylmer  Fisher  (1890-1962)  entitled  "The  correlation  between  rela- 
tives on  the  supposition  of  Mendelian  inheritance."  Fisher  examined  a math- 
ematical model  of  multifactorial  inheritance  and  deduced  the  expected 
correlations  between  relatives.  He  showed  that  the  kinds  of  data  available  for 
continuous  traits  were  not  only  compatible  with  Mendelian  inheritance  but 
were  also  predicted  by  it. 

The  spirit  of  Fisher's  model  is  shown  in  Figure  2.11,  which  illustrates  the 
genetic  variation  expected  among  the  progeny  of  a cross  between  genotypes 
that  are  heterozygous  for  each  of  three  unlinked  genes.  The  alleles  of  the 
genes  are  represented  A/a,  B/b,  and  C/c,  and  the  genetic  variation  resulting 
from  segregation  and  independent  assortment  is  evident  in  the  various 
degrees  of  shading.  If  we  assume  a trait  in  which  each  uppercase  allele  adds 
one  unit  to  the  phenotype  and  in  which  each  lowercase  allele  is  without 
effect,  then  the  aa  bb  cc  genotype  has  a phenotype  of  0 and  the  A A BB  CC 
genotype  has  a phenotype  of  6.  Thus  there  are  seven  possible  phenotypes 
(0-6)  among  the  progeny.  The  distribution  of  phenotypes  is  shown  in  the 
bar  graph  in  Figure  2.12.  The  smooth  curve  is  the  normal  distribution  approx- 
imating the  data,  which  has  a mean  of  3 and  a variance  of  1.5.  In  Figure  2.11, 
we  have  assumed  that  all  of  the  variation  in  phenotype  results  from  differ- 
ences in  genotype.  If  there  were  also  random  environmental  factors  affecting 
the  trait,  as  well  as  a greater  number  of  genes,  then  the  bars  in  Figure  2.12 
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Figure  2. 1 1 Result  of  segregation  of  three  independent  pairs  of  alleles  affect- 
ing  the  same  trait.  Each  allele  that  is  indicated  by  an  uppercase  letter  is  assumed 
to  contribute  one  unit  to  the  phenotype.  The  phenotypes  range  from  0 to  6 and, 
in  the  cross  between  triple  heterozygotes,  are  formed  in  the  proportions  1:6:15- 
20:15:6:1. 


would  become  less  distinct  and  a normal  distribution  approximated  even  bet- 
ter. The  result  is  the  central  limit  theorem  at  work  producing  Gabon's 
"supreme  law  of  unreason." 

Fisher's  model  was  a good  deal  more  complex  than  that  in  Figure  2.11, 
allowing  for  differences  in  the  effects  of  alleles,  differences  in  allele  frequen- 
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Figure  2.12  Distribution  of  phenotypes  from  the  cross  in  Figure  2.11  and  the 
approximating  normal  distribution.  The  normal  curve  has  mean  3 and  vari- 
ance 1.5. 


cy,  various  types  of  dominance  relations,  and  the  effects  of  random  environ- 
mental factors.  The  work  was  pathbreaking  in  demonstrating  that  continu- 
ous variation  could  be  explained  by  multiple  interacting  Mendelian  factors. 
Fisher's  model  was  complex  for  its  time  and  the  paper  a difficult  one.  It  is  not 
clear  even  now  what  practical  role  Fisher's  paper  may  have  played  in  ending 
the  controversy  between  the  biometricians  and  the  Mendelians.  Not  many 
people  seem  to  have  read  it.  On  the  other  hand,  it  is  the  seminal  paper  that 
marked  the  reconciliation  of  the  theories  of  Galton  and  Mendel. 


SUMMARY 

Galton  examined  the  statistical  relations  between  the  distributions  of  phe- 
notypic traits  in  successive  generations.  Most  of  the  traits  he  studied  were 
continuous  traits,  like  height  or  weight,  which  are  measured  on  a quantita- 
tive scale.  Galton  was  very  taken  with  the  observation  that  the  phenotypes 
of  many  continuous  traits  are  distributed  according  to  the  bell-shaped  curve 
known  as  the  normal  distribution.  The  peak  of  the  normal  distribution  is 
determined  by  the  mean  and  the  spread  is  determined  by  the  variance. 
Phenotypic  variation  in  natural  populations  is  usually  in  the  form  of  differ- 
ences in  continuous  traits.  Most  continuous  traits  are  also  multifactorial,  that 
is,  determined  by  the  combined  effects  of  multiple  genetic  and  environmen- 
tal factors.  The  normal  distribution  is  often  encountered  in  practice  because 
of  the  central  limit  theorem,  which  states  that  the  limiting  distribution  of  the 
sum  of  a large  number  of  independent  random  quantities  is  normal. 
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Mendel  studied  discrete  variation,  such  as  round  versus  wrinkled  peas, 
resulting  from  segregation  of  the  alleles  of  a single  gene.  Simple  Mendelian 
variation  is  the  rule  for  genes  and  their  products.  Genetic  variation  in  protein 
molecules  can  be  identified  by  such  techniques  as  protein  electrophoresis. 
Proteins  differing  in  electrophoretic  mobility  that  are  coded  by  alternative 
alleles  of  the  same  gene  are  called  allozymes.  Allozyme  variation  is  wide- 
spread in  most  organisms.  Based  on  electrophoretic  surveys  of  human  popu- 
lations, about  30%  of  all  enzyme-coding  genes  are  polymorphic  (in  the  sense 
that  the  most  common  allele  has  a frequency  less  than  0.95),  and  about  7%  of 
the  loci  are  heterozygous  in  an  average  person.  Plants  and  invertebrates  have 
even  higher  levels  of  allozyme  variation.  Although  there  is  wide  variation 
among  species.  Drosophila  averages  about  40%  polymorphic  loci  with  an  aver- 
age heterozygosity  of  14%. 

Genetic  variation  at  the  DN  A level  can  be  detected  with  the  Southern  blot 
procedure,  in  which  DNA  fragments  produced  by  a restriction  enzyme  are 
separated  by  electrophoresis  and  identified  by  hybridization  with  a homolo- 
gous labeled  probe  sequence.  Polymorphisms  in  the  length  of  restriction 
fragments  (restriction  fragment  length  polymorphisms)  are  abundant 
throughout  the  genome  and  have  applications  in  studies  of  genetic  linkage  in 
many  organisms.  DNA  studies  are  also  often  carried  out  with  the  polymerase 
chain  reaction  (PCR),  in  which  multiple  cycles  of  primer  annealing,  DNA 
replication,  and  strand  separation  are  used  to  exponentially  amplify  the  DNA 
sequence  flanked  by  the  oligonucleotide  primers.  Amplified  DNA  may  be 
sequenced,  used  as  a probe,  or  manipulated  in  other  ways. 

Polymorphisms  in  nucleotide  sequence  are  abundant  in  natural  popula- 
tions, particularly  in  noncoding  regions  and  at  silent  sites  in  coding  regions 
(especially  at  third  codon  positions,  in  which  a nucleotide  substitution  need 
not  result  in  an  amino  acid  replacement).  For  a sample  of  DNA  sequences, 
the  amount  of  nucleotide  polymorphism  is  the  proportion  of  nucleotide  sites 
occupied  by  two  or  more  bases  (A,  T,  G,  C)  in  the  sample.  The  nucleotide 
diversity  is  the  average  proportion  of  nucleotide  differences  between  all 
sequences  in  the  sample  taken  in  pairwise  comparison.  The  estimates  of 
nucleotide  polymorphism  and  nucleotide  diversity  are  not  readily  compared 
with  allozyme  data  because  much  of  the  observed  sequence  variation  is 
either  noncoding  or  silent. 

There  is  often  a disconnect  between  molecular  variation  and  phenotypic 
variation  because  differences  in  phenotype  among  healthy  organisms  cannot 
usually  be  attributed  to  differences  in  specific  molecules.  Indeed,  there  is  a 
sort  of  disconnect  between  simple  Mendelian  inheritance  and  continuous 
variation  because  the  segregation  of  any  pair  of  alleles  affecting  a continuous 
trait  is  obscured  by  the  segregation  of  other  pairs  of  alleles  as  well  as  by  the 
effects  of  the  environment.  In  the  early  years  after  the  rediscovery  of 
Mendel's  paper,  there  was  considerable  controversy  whether  Mendelian 
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factors  could  account  for  the  patterns  of  variation  and  correlation  among  rel- 
atives noted  by  Galton  and  others.  The  issue  was  resolved  theoretically  by 
R.  A.  Fisher's  1918  paper  on  the  correlation  between  relatives  on  the  supposi- 
tion of  Mendehan  inheritance.  Closing  the  gap  between  the  study  of  evolu- 
tion at  the  level  of  phenotypes  and  at  the  level  of  molecular  genotypes 
remains  one  of  the  major  challenges  in  population  genetics. 


PROBLEMS 

1.  Shell  widths  of  mussels  are  approximately  normally  distributed.  If  the 
mean  is  70  mm  and  the  standard  deviation  is  10  mm,  what  fraction  of 
the  population  is  smaller  than  80  mm? 

2.  Following  Problem  1,  what  fraction  of  the  population  is  between  80  and 
90  mm  in  width? 

3.  Calculate  the  mean,  variance,  standard  deviation,  and  standard  error  of 

the  mean  for  the  following  bristle  counts:  13, 14  13  15  14  15  \2  15  14 
16, 12, 15, 13, 14.  ' 

4.  Measurements  of  body  weight  of  a very  large  sample  from  a species  of 
mouse  have  a mean  of  60  g and  variance  of  64  g2.  In  one  area  it  was  sus- 
pected that  environmental  contamination  had  reduced  the  size  of  the 
mice.  A sample  of  100  mice  from  this  area  had  a mean  of  58  g and  a sam- 
ple variance  of  64  g . Is  this  sample  population  significantly  smaller  in 
size  than  the  population  examined  with  the  very  large  sample? 

5.  A standard  means  for  using  a computer  to  generate  normally  distributed 
random  numbers  is  to  take  12  uniform  random  numbers  and  add  them 
up.  After  scaling  the  sum  by  a constant  that  depends  on  the  mean  and  the 
variance,  the  result  represents  a sample  from  the  normal  distribution  one 
wants.  Why  does  this  approach  work? 

6.  One  statement  of  the  central  limit  theorem  is  that  the  sum  of  indepen- 
dent, identically  distributed  random  variables  has  a limiting  normal  dis- 
tribution. If  the  variables  that  are  being  added  exhibit  positive  covariance 
in  successive  measures  (as  opposed  to  being  independent),  how  would 

t re  sum  deviate  from  the  normal  distribution  predicted  by  the  central 
limit  theorem? 

7.  Allozyme  gels  reveal  a sample  with  64  FF,  32  FS  and  4 SS  females,  but 
t ere  seem  to  be  40  FF  males  and  10  SS  males  with  no  heterozygotes 
Flow  do  you  explain  these  data? 

8.  Many  proteins  exist  in  an  active  form  only  as  dimers,  with  two  molecules 
joined  either  by  hydrogen  bonding  or  even  by  covalent  cysteine  bridges. 

an  enzyme  is  only  active  as  a dimer,  and  there  is  electrophoretic  varia- 
tion in  a population  with  two  alleles  (F  and  S),  what  do  you  think  a het- 
erozygote would  look  like  on  a gel?  What  would  a heterozygote  look  like 
if  only  tetramers  were  active? 
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9.  How  many  copies  of  a fragment  of  DNA  should  be  present  after  30 
rounds  of  PCR,  assuming  perfect  efficiency? 

10.  Taq  polymerase  does  not  have  perfect  fidelity  in  copying  DNA  sequences, 
and  the  result  is  that  PCR  products  have  some  variation  in  sequence. 
Why  do  you  suppose  it  is  still  possible  to  sequence  PCR-amplified  DNA 
to  obtain  the  true  sequence?  When  might  the  errors  caused  by  Taq  poly- 
merase cause  errors  in  the  final  sequence? 

11.  Many  new  ways  of  scoring  DNA  variation  at  individual  nucleotide  sites 
are  becoming  available,  including  an  oligonucleotide  ligation  assay, 
"Taqman,"  template-directed  dye-terminator  incorporation  (TDI),  and 
hybridization  to  dense  oligonucleotide  arrays  known  as  DNA  chips.  An 
important  criterion  for  the  utility  of  any  of  these  methods  is  that  it  must 
be  very  accurate.  Why  is  accuracy  so  critical? 

12.  Four  sequences  of  a 1200  bp  gene  gave  the  following  counts  of  pairwise 
differences:  4,  7,  5,  3,  6,  5.  What  is  the  estimate  of  nucleotide  diversity  for 
this  sample? 

13.  In  forensic  applications  of  genetics,  if  the  DNA  types  from  a crime  scene 
and  a suspect  do  not  match,  the  confidence  one  has  in  the  conclusion  is 
much  greater  than  if  the  types  do  match.  Why? 
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he  word  population  has  so  far  been  used  in  an  informal,  intuitive 
sense  to  refer  to  a group  of  organisms  belonging  to  the  same 
species.  Further  discussion  and  clarification  of  the  concept  is  nec- 
essary at  this  time.  In  population  genetics,  the  word  population  does  not 
usually  refer  to  an  entire  species;  it  refers  instead  to  a group  of  organisms  of 
the  same  species  living  within  a sufficiently  restricted  geographical  area  that 
any  member  can  potentially  mate  with  any  other  member  (provided  that 
they  are  of  the  opposite  sex).  Precise  definition  of  such  a unit  is  difficult  and 
varies  from  species  to  species  because  of  the  almost  universal  presence  of 
some  sort  of  geographical  structure  in  species — some  typically  nonrandom  pat- 
tern in  the  spatial  distribution  of  organisms.  Members  of  a species  are  rarely 
distributed  homogeneously  in  space:  there  is  almost  always  some  sort  of 
clumping  or  aggregation,  some  schooling,  flocking,  herding,  or  colony  for- 
mation. Population  subdivision  is  often  caused  by  environmental  patchiness, 
areas  of  favorable  habitat  intermixed  with  unfavorable  areas.  Such  environ- 
mental patchiness  is  obvious  in  the  case  of,  for  example,  terrestrial  organisms 
on  islands  in  an  archipelago,  but  patchiness  is  a common  feature  of  most 
habitats — freshwater  lakes  have  shallow  and  deep  areas,  meadows  have 
marshy  and  dry  areas,  forests  have  sunny  and  shady  areas.  Population  sub- 
division can  also  be  caused  by  social  behavior,  as  when  wolves  form  packs. 
Even  the  human  population  is  clumped  or  aggregated— into  towns  and 
cities,  away  from  deserts  and  mountains. 
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The  local  interbreeding  units  of  possibly  large,  geographically  structured 
populations  are  of  some  interest  because  it  is  within  such  local  units  that 
adaptive  evolution  takes  place  through  systematic  changes  in  allele  frequen- 
cy. Such  local  interbreeding  units— often  called  local  populations  or 
demes— are  the  fundamental  units  of  population  genetics.  Local  populations 
are  the  actual,  evolving  units  of  a species.  Unless  otherwise  specified  (or  clear 
from  context),  the  term  population  as  used  in  this  book  means  local  population. 
Local  populations  are  sometimes  also  referred  to  as  Mendelian  populations  or 
subpopulations. 


RANDOM  MATING 

In  sexual  organisms,  genotypes  are  not  transmitted  from  one  generation  to 
the  next.  Genotypes  are  broken  up  in  gamete  formation  by  the  processes  of 
segregation  and  recombination,  and  they  are  assembled  anew  in  each  gener- 
ation in  fertilization:  genotypes  ->•  gametes  ->  genotypes.  The  frequency  of  a 
specified  genotype  in  a population  is  the  genotype  frequency.  The  formation 
of  a genotype  in  newly  fertilized  eggs  is  determined  by  the  opportunity  for 
the  relevant  gametes  to  come  together  in  fertilization,  and  the  opportunity  for 
gametes  to  come  together  in  fertilization  is  determined  by  the  matings  that 
take  place  among  organisms  of  reproductive  age  in  the  previous  generation. 
To  put  the  matter  in  a slightly  different  way,  the  genotypes  of  the  mating 
pairs  determine  the  genotypes  of  the  progeny.  Furthermore,  there  are  mathe- 
matical relationships  between  the  frequencies  of  mating  pairs  and  the  fre- 
quencies of  progeny  genotypes.  Such  mathematical  relationships  are  usually 
inferred  from  models  in  which  the  types  of  matings  in  the  population  are 
specified.  One  of  the  important  models  in  population  genetics  is  that  of  ran- 
dom mating,  in  which  mating  pairs  have  the  same  frequencies  as  if  they  were 
formed  by  random  collisions  between  genotypes.  The  chance  that  an  organ- 
ism mates  with  another  having  a prescribed  genotype  is  therefore  equal  to 
the  frequency  of  the  prescribed  genotype  in  the  population.  For  example, 
suppose  that  in  some  population  the  genotype  frequencies  of  AA,  Aa,  and  aa 
are  0.16, 0.48,  and  0.36,  respectively;  if  mating  is  random,  AA  males  mate  with 
AA,  Aa,  and  aa  females  in  the  proportions  0.16,  0.48,  and  0.36,  respectively; 
these  same  proportions  apply  to  the  mates  of  Aa  and  aa  males. 

Superficial  appearances  to  the  contrary,  random  mating  is  not  a simple  or 
trivial  process.  One  complication  is  that  random  mating  depends  on  the  trait: 
mating  can  be  random  with  respect  to  some  traits  but  nonrandom  with 
respect  to  other  traits  at  the  same  time  and  in  the  same  population.  For  exam- 
ple, it  is  perfectly  consistent  for  a human  population  to  undergo  random  mat- 
ing with  respect  to  blood  groups,  allozyme  phenotypes,  restriction  fragment 
length  polymorphisms,  and  many  other  characteristics,  but  at  the  same  time 
to  engage  in  nonrandom  mating  with  respect  to  other  traits  such  as  skin  color 
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and  height.  A second  complication  is  population  substructure.  Paradoxical  as 
it  may  seem,  random  mating  may  be  observed  within  each  of  the  subpopu- 
lations constituting  a larger  population,  but  random  mating  may  still  fail  to 
hold  in  the  population  as  a whole.  (The  reason  for  this  paradox  is  discussed 
in  Chapter  4.)  In  spite  of  these  and  other  complications,  random  mating  plays 
an  important  role  in  models  in  population  genetics  because  random  mating 
often  serves  as  a point  of  departure  for  considering  more  realistic  situations. 

Nonoverlapping  Generations 

One  of  the  most  important  mathematical  models  in  population  genetics  is  the 
nonoverlapping  generation  model,  in  which  the  cycle  of  birth,  maturation, 
and  death  includes  the  death  of  all  organisms  present  in  each  generation 
before  the  members  of  the  next  generation  mature.  The  nonoverlapping  gen- 
eration model  is  diagrammed  in  Figure  3.1.  The  model  applies  literally  only 
to  organisms  with  a very  simple  sort  of  life  history,  such  as  certain  short-lived 
insects  or  annual  plants  that  have  a short  growing  season.  In  such  plants,  all 
members  of  any  generation  germinate  at  about  the  same  time,  mature  togeth- 
er, shed  their  pollen,  are  fertilized  almost  simultaneously,  and  die  immedi- 
ately after  producing  the  new  generation.  This  sort  of  hypothetical 
population,  with  its  simple  life  history,  is  used  in  population  genetics  as  a 
first  approximation  to  populations  that  have  more  complex  life  histories. 
Although  at  first  glance  the  model  seems  hopelessly  oversimplified,  calcula- 


Generation  f - 1 


Generation  f 


Generation  f + 1 


Figure  3.1  The  nonoverlapping  generation  model.  The  life  history  of  the 
organism  is  assumed  to  be  like  that  of  an  annual  plant  (or  any  short-lived 
organism),  and  the  generations  are  assumed  to  be  separated  in  time  (discrete 
generations).  Although  the  model  is  simple,  it  provides  a convenient  first 
approximation  to  populations  with  more  complex  life  histories. 
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tions  of  expected  genotype  frequencies  based  on  the  model  are  adequate  for 
many  purposes.  In  some  applications,  the  nonoverlapping  generation  model 
turns  out  to  be  a useful  approximation  even  for  populations  with  a long  and 
complex  life  history,  such  as  human  beings. 

The  Hardy-Weinberg  Principle 

Genotype  frequencies  are  determined  in  part  by  the  pattern  of  mating.  In  this 
section,  we  consider  the  consequences  of  random  mating  in  the  model  with 
nonoverlapping  generations.  To  deduce  the  genotype  frequencies  under  ran- 
dom mating,  additional  assumptions  are  needed.  First,  the  allele  frequencies 
should  not  change  from  one  generation  to  the  next  because  of  systematic  evo- 
lutionary forces,  the  most  important  of  which  are  mutation,  migration  and 
natural  selection.  For  the  moment,  these  evolutionary  forces  are  assumed  to 
be  absent  or  negligibly  small  in  magnitude.  (Their  effects  are  discussed  in 
Chapters  5 and  6.)  Second,  the  population  must  be  large  enough  in  size  that 
the  allele  frequencies  are  not  subject  to  change  merely  because  of  sampling 
error.  Variation  in  allele  frequency  owing  to  sampling  error  in  small  popula- 
tions is  called  random  genetic  drift  and  is  the  subject  of  Chapter  7.  Although 
random  genetic  drift  is  present  unless  the  population  is  infinite  in  size,  the 
magnitude  of  the  effect  on  allele  frequency  over  a small  number  of  genera- 
tions is  usually  sufficiently  small  that  the  process  can  be  ignored  if  popula- 
tion size  is  500  or  more.  The  qualifier  “over  a small  number  of  generations"  is 
important  because  the  effects  of  random  genetic  drift  are  cumulative.  Con- 
sidered over  a sufficiently  large  number  of  generations,  random  genetic  drift 
can  be  important  even  in  populations  of  size  106  or  more. 

Before  proceeding  further,  it  may  be  helpful  to  summarize  the  assump- 
tions that  we  are  making:  t 

• The  organism  is  diploid. 

• Reproduction  is  sexual. 

• Generations  are  nonoverlapping. 

• The  gene  under  consideration  has  two  alleles. 

• The  allele  frequencies  are  identical  in  males  and  females. 

• Mating  is  random. 

• Population  size  is  very  large  (in  theory,  infinite). 

• Migration  is  negligible. 

• Mutation  can  be  ignored. 

• Natural  selection  does  not  affect  the  alleles  under  consideration. 

Collectively,  these  assumptions  summarize  the  Hardy-Weinberg  model , 
named  after  the  English  mathematician  G.  H.  Hardy  (1877-1947)  and  the 
German  physiologist  Wilhelm  Weinberg  (1862-1937),  who,  in  1908,  indepen- 
dently formulated  the  model  and  deduced  its  theoretical  predictions  of  geno- 
type frequency.  b 
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In  the  Hardy-Weinberg  model,  the  mathematical  relation  between  the 
allele  frequencies  and  the  genotype  frequencies  is  given  by 

AA : p2  Aa:  2 pq  an:  q2  3.1 

in  which  p2,  2pq,  and  q2  are  the  frequencies  of  the  genotypes  AA,  Aa,  and  aa  in 
zygotes  of  any  generation,  p and  q are  the  allele  frequencies  of  A and  a in 
gametes  of  the  previous  generation,  and  p + q = 1.  The  frequencies  displayed 
in  Equation  3.1  constitute  the  Hardy-Weinberg  principle  or  the  Hardy- 
Weinberg  equilibrium  (HWE). 

One  rationale  for  the  Hardy-Weinberg  principle  displayed  in  Equation  3.1 
is  based  on  the  outcome  of  repeated  and  independent  trials.  With  random 
mating,  the  choices  of  male  gamete  and  female  gamete  are  independent  tri- 
als, and  so  pairs  of  gametes  carrying  the  alleles  AA,  Aa,  or  aa  are  expected  in 
proportions  given  by  (v  A + a a)2  = y1  AA  + 2va  Aa  + q2  aa.  A graphical  illus- 
tration of  the  rationale  of  independent  trials  is  shown  in  Figure  3.2.  The 
chance  of  two  A -bearing  gametes  coming  together  is  p x p = p2  and  that  of 
two  rt-bearing  gametes  coming  together  is  qxq  = q:;  for  the  heterozygote,  the 
chance  is p x q + q xp  = 2pq  because  the  female  gamete  could  carry  A and  the 
male  gamete  carry  a,  or  the  other  way  around. 


Male  gametes 


Allele  A a 

Frequency  p q 


Allele  Frequency 
A p 

Female 

gametes 

a q 


AA 

Aa 

v2 

pq 

aA 

aa 

w 

q2 

Summed  frequencies  in  zygotes: 

AA:  P'  = p2 

Aa:  Q'  = pq  + qp  = 2 pq 

aa:  R'  = q2 

Figure  3.2  Cross-multiplication  square  showing  Hardy-Weinberg  frequencies 
resulting  from  random  mating  with  two  alleles. 
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TABLE  3.1 

DEMONSTRATION  OF  THE  HARDY-WEINBERG  PRINCIPLE 

Mating 

Frequency  of 
mating  (parents) 

Frequency  of  zygotes  (progeny) 

AA 

Aa 

aa 

AA  x AA 

P2 

1 

0 

0 

AA  x Aa 

2PQ 

v2 

V, 1 

0 

AA  x aa 

2 PR 

0 

1 

0 

Aa  x Aa 

Q2 

Vi 

Vi 

Vi 

Aa  x aa 

2 QR 

0 

Vi 

Vi 

aa  x aa 

R2 

0 

0 

1 

therefore 

Totals  (next  generation) 

p' 

Q’ 

R' 

P'  = 

P2  + 2PQ/2  + Q2/ 4 = (P  + Q/2) 

\2  = v2 

Q'  = 

2PQ/2  + 2 PR  + Q2/ 2 + 2QR/2 

= 2 (P  + Q/2)(R  + Q/2)  = 2 pq 

R'  = 

Q-/4  + 2QK/2  + K = (K  + Q/2)"  = q2 

Random  Mating  of  Genotypes  versus  Random  Union  of  Gametes 

Figure  3.2  implicitly  assumes  an  important  premise:  that  random  mating  of 
genotypes  is  equivalent  to  random  union  of  gametes.  A demonstration  of  this 
premise  in  the  case  of  two  alleles  is  outlined  in  Table  3.1,  in  which  pairs  of 
genotypes  are  chosen  at  random  to  form  matings.  The  genotype  frequencies 
of  AA,  Aa,  and  aa  in  the  parental  generation  are  written  as  P,  Q,  and  R, 
respectively,  where  P + Q + P = 1 . In  terms  of  the  genotype  frequencies,  the 
allele  frequencies  p of  A and  q of  a are  as  follows: 

p = (2  x P + Q)/2  = P + Q/2 

q = (2xR  + Q)/2  = R + Q/2 

Note  that/pT q = P + y + k\=  1.0;  this  result  is  a consequence  of  the  fact 
that  the  gene  has  only  two  alleles. 

With  two  alleles  of  a gene,  there  are  six  possible  types  of  matings.  When 
mating  is  random,  these  mating  types  take  place  in  proportion  to  the  geno- 
typic  frequencies  in  the  population,  and  the  types  of  mating  pairs  are  given 
by  successive  terms  in  the  expansion  of  (P  AA  + Q Aa  + R aaf.  For  example, 
the  proportion  of  AA  x AA  matings  is  P xP  - P 2.  Similarly,  the  proportion  of 
AA  x Aa  matings  is  2 xPxQ  because  the  mating  can  include  either  an  AA 
male  with  an  Aa  female  (proportion  PxQ)  or  an  Aa  male  with  an  AA  female 
(proportion  Q x P).  The  frequencies  of  these  and  the  other  types  of  matings 
are  given  in  the  second  column  of  Table  3.1. 
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The  genotypes  of  the  zygotes  produced  by  the  matings  are  given  in  the 
last  three  columns  of  Table  3.1.  The  offspring  frequencies  follow  from 
Mendel's  law  of  segregation,  which  states  that  an  Aa  heterozygote  produces 
an  equal  number  of  A-bearing  and  a-bearing  gametes.  The  AA  and  aa 
homozygotes  produce  only  A-bearing  and  only  a-bearing  gametes,  respec- 
tively. Thus,  the  mating  AA  x aa  produces  all  Aa  zygotes,  the  mating  AA  x Aa 
produces  l/2  AA  and  '/2  Aa  zygotes,  the  mating  Aa  x Aa  produces  >/4  AA, 
V2  Aa,  and  X/A  aa  zygotes,  and  so  forth. 

The  genotype  frequencies  of  AA,  Aa,  and  aa  zygotes  after  one  generation 
of  random  mating  are  denoted  in  Table  3.1  as  P',  Q',  and  R',  respectively. 
These  values  are  calculated  as  the  sum  of  the  cross-products  shown  at  the 
bottom  of  the  table.  The  genotype  frequencies  simplify  to  P'  = p2,  Q'  = 2 pq, 
and  R'  = q",  where  p and  q are  the  allele  frequencies  given  in  Equation  3.2. 
Note  that  the  parental  genotype  frequencies — P,  Q,  and  R — were  completely 
arbitrary  except  for  the  requirement  that  P + Q + R = 1.  Therefore,  the  Hardy- 
Weinberg  frequencies  are  attained  after  one  generation  of  random  mating 
irrespective  of  the  genotype  frequencies  in  the  parental  generation. 


PROBLEM  3.1  A four-base  cleavage  site  for  the  restriction  enzyme 
Banl  is  located  within  a large  intron  of  the  larval  transcript  of  the  gene 
coding  for  alcohol  dehydrogenase  in  D.  melanogaster.  Cleavage  at  this 
site  was  found  in  29  of  60  chromosomes  isolated  from  a population 
sampled  at  a farmer's  market  in  Raleigh,  North  Carolina  (Kreitman 
and  Aguade  1986).  Letting  B and  b represent  the  presence  or  absence, 
respectively,  of  the  Banl  site  in  a chromosome,  and  assuming  Hardy- 
Weinberg  genotype  frequencies,  calculate  the  expected  frequencies  of 
the  genotypes  BB,  Bb,  and  bb. 


ANSWER  The  estimated  allele  frequencies  are  p = 29/60  = 0.48  of  B 
and  q = 1 - p = 0.52  of  b,  and  so  the  expected  genotype  frequencies 
with  HWE  are  p2  = 0.23  BB,  2 pq  = 0.50  Bb,  and  f = 0.27  bb. 


PROBLEM  3.2  In  an  experimental  population  of  D.  melanogaster,  the 
genotype  frequencies  for  two  alleles,  E6F  and  E6S,  of  the  gene  coding 
for  esterase-6  were  found  to  be  consistent  with  Hardy-Weinberg  pro- 
portions with  allele  frequencies  of  0.3579  for  E6F  and  0.6421  for  E6S 
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(Mukai  et  al.  1974).  Assuming  that  all  of  the  assumptions  of  the 
Hardy-Weinberg  model  hold,  particularly  those  pertaining  to  random 
mating  in  a large  population  with  no  mutation,  selection,  or  migra- 
tion, make  a table  of  mating  frequencies  similar  to  Table  3.1  for  the 
esterase-6  alleles.  Then  calculate  the  genotype  frequencies  expected  in 
the  next  generation  along  with  the  corresponding  allele  frequencies. 


ANSWER  The  Hardy-Weinberg  frequencies  among  parents  are  FF: 
0.1281;  FS:  0.4596,  and  SS:  0.4123.  Therefore,  the  expected  frequencies 
of  the  matings  are:  FF  x FF  (0.0164);  FF  x FS  (0.1177);  FF  x SS  (0.1056); 
FS  x FS  (0.2112);  FS  x SS  (0.3790);  and  SS  x SS  (0.1700).  The  expected 
genotype  frequencies  among  the  zygotes  are,  for  FF,  0.0164  + 0.1177/2 
+ 0.2112/4  = 0.1281;  for  FS,  0.1177/2  + 0.1056  + 0.2112/2  + 0.3790 /2  = 
0.4596;  for  SS,  0.2112/ 4 + 0.3790/2  + 0.1700  = 0.4123;  note  that  these 
are  the  same  as  in  the  parental  generation.  The  allele  frequencies  of  F 
and  S are  again  0.3579  and  0.6421,  respectively. 


PROBLEM  3.3  Use  a cross-multiplication  square  like  that  in  Figure 
3.2  to  show  that,  when  the  allele  frequencies  differ  in  male  and  female 
parents,  the  Hardy-Weinberg  frequencies  are  not  attained  after  one 
generation  of  random  mating.  Use  the  symbols  pm  and  qm  for  the  fre- 
quencies of  A and  a in  male  gametes  and  the  symbols  pf  and  qt  for  the 
frequencies  of  A and  a in  female  gametes.  After  the  first  generation  of 
random  mating,  what  are  the  genotype  frequencies  in  male  and 
female  zygotes?  What  are  the  allele  frequencies  in  male  and  female 
zygotes?  What  are  the  genotype  frequencies  in  zygotes  after  the  sec- 
ond generation  of  random  mating?  Are  these  in  Hardy-Weinberg  pro- 
portions? 


ANSWER  This  problem  demonstrates  the  principle  that,  with  ran- 
dom mating,  the  frequency  of  an  allele  in  zygotes  equals  the  average 
of  the  allele  frequencies  in  the  parents.  If  the  allele  frequencies  in  par- 
ents differ,  then  random  mating  results  in  Hardy-Weinberg  propor- 
tions only  after  two  generations.  The  first  generation  equalizes  the 
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allele  frequencies  in  males  and  females,  and  the  second  generation 
yields  the  Hardy-Weinberg  proportions.  Using  the  suggested  sym- 
bols, after  one  generation  of  random  mating,  the  genotype  frequen- 
cies are  AA:  pm  x pf,  Aa : pmxqf  + qm  x pt,  and  aa:  qm  x qf.  These  are  not 
in  the  form  x2,  2x(l  - x),  and  (1  - xf  unless  pm  = pf  and  qm  = q(.  How- 
ever, the  allele  frequencies  have  become  equal  in  the  sexes  at  p = pm  pt 
+ (Pm<7f  + qmPd/2  = (Pm  + Pf)/2  and  q = (qm  + q,)/2.  The  HWE  is  reached 
in  one  additional  generation  of  random  mating,  in  which  the  geno- 
type frequencies  in  zygotes  are  p2,  2 pq,  and  q2. 


Implications  of  the  Hardy-Weinberg  Principle 

The  Hardy-Weinberg  principle  has  provided  the  foundation  for  many  theo- 
retical and  experimental  investigations  in  population  genetics.  However,  the 
theory  is  far  from  profound,  and  the  applicability  is  far  from  universal. 
Hardy  especially  seems  to  have  regarded  the  Hardy-Weinberg  principle  as 
virtually  self-evident.  He  writes,  "I  should  have  expected  the  very  simple 
point  which  I wish  to  make  to  have  been  familiar  to  biologists.''  In  fact,  it  was 
familiar  to  some  biologists — the  basic  principle  had  been  noted  as  early  as 
1903  by  the  Harvard  geneticist  William  E.  Castle  (1867-1962).  Castle's  work 
was  little  known,  however,  and  Hardy  was  writing  to  counter  an  argument 
put  forth  against  Mendelism  that  phenotypic  ratios  of  3 dominant  to  1 reces- 
sive should  be  encountered  frequently  in  natural  populations  if  the  mecha- 
nism of  Mendelian  heredity  were  generally  applicable.  The  immediate 
implication  of  the  Hardy-Weinberg  principle  was  to  refute  the  3 : 1 argument 
by  showing  that  the  genotypic  ratio  of  A-  : aa  is  determined  by  the  allele  fre- 
quencies and  has  no  special  tendency  to  attain  one  particular  ratio  as  any 
other. 

Beyond  the  virtue  of  simplicity,  why  would  anyone  want  to  consider  a 
model  based  on  so  many  restrictive  and  seemingly  incorrect  assumptions? 
And  in  what  sense  can  such  a simple  model  be  considered  fundamental? 
Among  several  reasons,  two  stand  out.  First,  the  Hardy-Weinberg  model  is  a 
reference  model  in  which  there  are  no  evolutionary  forces  at  work  other  than 
those  imposed  by  the  process  of  reproduction  itself.  In  this  sense,  the  model 
is  similar  to  models  in  mechanical  physics  where  objects  fall  through  the  sky 
without  wind  resistance  or  roll  down  inclined  planes  without  friction.  The 
model  affords  a baseline  for  comparison  with  more  realistic  models  in  which 
evolutionary  forces  can  change  allele  frequencies.  Perhaps  more  importantly, 
the  Hardy-Weinberg  model  separates  life  history  into  two  intervals:  gametes 
— > zygotes  and  zygotes  — > adults.  In  constructing  more  complex  and  realistic 
models,  one  can  often  introduce  the  complications  into  the  zygotes  — > adults 
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part  of  the  life  cycle— for  example,  in  considering  the  effects  of  migration  into 
the  population  or  of  differential  survival  among  the  genotypes.  With  all 
sources  of  change  in  allele  frequency  accounted  for  in  the  zygotes  ->  adults 
component,  the  gametes  ->  zygotes  component  follows  from  the  principle 
that  random  union  of  gametes  and  results  in  the  Hardy-Weinberg  propor- 
tions among  zygotes.  In  other  words,  the  Hardy-Weinberg  model  is  funda- 
mental in  the  sense  that  the  approach  of  tracking  allele  and  genotype 
frequencies  through  time  can  be  generalized  to  more  realistic  situations. 

One  of  the  most  important  implications  of  the  Hardy-Weinberg  principle 
emerges  when  we  calculate  the  allele  frequencies  of  A and  a in  the  next  gen- 
eration from  the  formulas  for  P',  Q',  and  R'  in  Table  3.1.  Using  the  result  in 
Equation  3.2,  the  allele  frequency  of  A among  the  zygotes  equals  P'  + Q'/2  = 
p + 2pq/2  = p(p  + q)  = p.  Likewise,  the  allele  frequency  of  a among  zygotes 
equals  R'  + Q'/2  = q-  + 2pq/2  = q(q  +p)  = q.  Thus,  the  allele  frequencies  in  the 
next  generation  are  exactly  the  same  as  they  were  the  generation  before.  With 
random  mating,  the  allele  frequencies  remain  the  same  generation  after  gen- 
eration. In  any  generation,  therefore,  the  genotype  frequencies  are  p2,  2 pq, 
and  q for  AA,  Aa,  and  aa,  respectively,  as  given  in  Equation  3.1 . The  constan- 
cy of  allele  frequency  and  therefore  of  the  genotypic  composition  of  the 
population— is  the  single  most  important  implication  of  the  Hardy-Weinberg 
principle.  The  constancy  of  allele  frequencies  implies  that,  in  the  absence  of 
specific  evolutionary  forces  to  change  allele  frequency,  the  mechanism  of 
Mendelian  inheritance,  by  itself,  keeps  the  allele  frequencies  constant  and 
thus  preserves  genetic  variation.  A second  item  of  interest  is  that  the  Hardy- 
Weinberg  frequencies  are  attained  in  just  one  generation  of  random  mating  if 
the  allele  frequencies  are  the  same  in  males  and  females.  This,  however,  is 
true  only  with  nonoverlapping  generations;  in  populations  with  more  com- 
plex life  histories,  the  Hardy-Weinberg  frequencies  are  attained  gradually 
over  a period  of  several  generations. 

It  is  important  to  note  here  that  conventional  statistical  tests  for  Hardy- 
Weinberg  proportions  (such  as  the  y2  test  discussed  below)  are  not  very  sen- 
sitive to  deviations  from  the  expected  genotype  frequencies.  Consequently, 
the  mere  fact  that  observed  genotype  frequencies  may  happen  to  fit  the 
Hardy-Weinberg  proportions  cannot  be  taken  as  evidence  that  all  of  the 
assumptions  underlying  the  model  are  valid.  The  most  that  can  be  concluded 
is  that,  whatever  departures  from  the  assumptions  there  may  be,  they  are 
not  sufficiently  large  to  result  in  deviations  from  HWE  that  are  detectable 
with  conventional  statistical  tests. 

The  Hardy-Weinberg  Principle  in  Operation 

Application  of  the  Hardy-Weinberg  principle  can  be  illustrated  with  data  on 
the  MN  blood  groups  in  a British  population.  In  a sample  of  1000  people 
(Race  and  Sanger  1975),  the  observed  phenotypes  were  298  blood  group  M 
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(indicating  genotype  MM),  489  blood  group  MN  (indicating  genotype  MN), 
and  213  blood  group  N (indicating  genotype  NN).  To  determine  whether 
these  genotype  frequencies  are  in  accord  with  HWE,  the  allele  frequencies  of 
M and  N must  first  be  estimated.  The  estimated  allele  frequency  p of  M is 
1085/2000  = 0.5425  and  that  q of  N is  915/2000  = 0.4575.  (For  the  details,  see 
Problem  1.4  in  Chapter  1.)  Were  the  population  in  HWE,  we  would  expect  the 
genotype  frequencies  of  MM,  MN,  and  NN  to  be  p2,  2 pq,  and  q2,  respectively, 
where  p and  q are  the  allele  frequencies  in  the  underlying  population  from 
which  the  sample  was  drawn.  Because  p and  q are  parameters,  their  true  val- 
ues are  unknown.  However,  in  testing  for  HWE  we  can  substitute  the  esti- 
mated values  to  obtain  the  expected  proportions  MM:  (0.5425)2  = 0.2943,  MN: 
2(0. 5425)(0. 4575)  = 0.4964,  and  NN:  (0.4575)2  = 0.2093,  respectively.  Because 
the  sample  size  is  1000,  the  expected  numbers  of  the  MM,  MN,  and  NN  geno- 
types are  0.2943  x 1000  = 294.3,  0.4964  x 1000  = 496.4,  and  0.2093  x 1000  = 
209.3,  respectively. 

At  this  point,  it  is  convenient  to  tabulate  the  data  into  three  columns,  the 

first  giving  the  genotypes,  the  second  giving  the  observed  numbers,  and  the 

third  giving  the  expected  numbers:  6\jse^ed  /, 

a cnAyflC  tvptik4- 

MM  298  294.3 

MN  489  496.4 

NN  213  209.3 


With  the  data  so  arrayed,  it  is  evident  that  the  fit  between  the  observed 
numbers  and  the  expected  numbers,  though  not  perfect  because  of  chance 
statistical  fluctuations  in  the  number  of  each  genotype  that  may  be  included 
in  any  given  sample,  is  nevertheless  very  close.  To  verify  this  conclusion,  we 
will  apply  a conventional  statistical  test  to  the  data  in  order  to  assess  quanti- 
tatively the  closeness  of  fit.  A test  commonly  employed  in  population  genet- 
ics is  called  the  chi-square  test,  which  is  based  on  the  value  of  a number, 
called  %2,  calculated  from  the  data  as 


X2  = S 


( obs  - exp)2 
exp 


3.3 


where  obs  refers  to  the  observed  number  in  any  genotypic  class,  exp  refers  to 
the  expected  number  in  the  same  genotypic  class,  and  the  Z sign  denotes  that 
the  values  are  to  be  summed  over  all  genotypic  classes.  In  the  case  at  hand, 


X2  = (298  - 294.3)7294.3 


+ (489  - 496.4)7496.4 


<2 

e>  -J-  *4*^ 


+ (213  - 209.3)7209.3 


= 0.222 
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To  be  completely  unambiguous,  some  statisticians  prefer  use  of  the  sym- 
bol X2  for  the  realized  value  of  the  test  statistic  defined  Equation  3.3,  in  order 
to  distinguish  between  the  test  statistic  and  the  true  y2  distribution  itself.  The 
distinction  should  certainly  be  kept  in  mind,  but  we  will  not  recognize  it  for- 
mally with  different  symbols. 

Associated  with  any  y2  value  is  a second  number  called  the  degrees  of 
freedom  for  that  y‘.  In  general,  the  number  of  degrees  of  freedom  ( df)  associ- 
ated with  a y2  value  equals 

df=  Number  of  classes  of  data 

- Number  of  parameters  estimated  from  the  data 
-1 

In  the  MN  example,  there  are  three  classes  of  data  and  one  parameter  ( p ) 
estimated  from  the  data,  and  so  df  = 3 — 1 — 1 = 1.  Note  that  a degree  of  free- 
dom is  not  subtracted  for  estimating  q because  of  the  relation  q = \~p-  that  is, 
once  p has  been  estimated,  the  estimate  of  q is  automatically  fixed,  and  so  we 
deduct  just  the  one  degree  of  freedom  corresponding  to  p. 

Calculation  of  y and  its  associated  degrees  of  freedom  is  carried  out  in 
order  to  obtain  a number  for  assessing  goodness  of  fit;  the  number  is  deter- 
mined from  Figure  3.3.  To  use  the  chart,  find  the  value  of  y2  along  the  hori- 
zontal axis,  then  move  vertically  from  this  value  until  the  line  for  the  number 
of  degree  of  freedom  is  intersected,  then  move  horizontally  from  the  point  of 
intersection  to  the  vertical  axis  and  read  the  corresponding  probability  value 
P.  In  our  case,  with  y = 0.222  and  one  degree  of  freedom,  the  corresponding 
probability  value  is  about  P = 0.67.  The  probability  associated  with  a particu- 
lar y"  test  has  the  following  interpretation:  it  is  the  probability  that  chance 
alone  could  produce  a deviation  between  the  observed  and  expected  values 
at  least  as  great  as  the  deviation  actually  realized.  Thus,  if  the  probability  is 
large,  it  means  that  chance  alone  could  account  for  the  deviation,  and  it 
strengthens  our  confidence  in  the  validity  of  the  model  used  to  obtain  the 
expectations— in  this  case,  the  Hardy-Weinberg  model.  On  the  other  hand,  if 
the  probability  associated  with  the  y‘  is  small,  it  means  that  chance  alone  is 
not  likely  to  lead  to  a deviation  as  large  as  actually  realized,  and  it  under- 
mines our  confidence  in  the  validity  of  the  model.  Where  exactly  the  cutoff 
should  be  between  a "large"  probability  and  a "small"  one  is,  of  course,  not 
obvious,  but  there  is  an  established  guideline  to  follow.  If  the  probability  is 
less  than  0.05,  then  the  goodness  of  fit  is  considered  sufficiently  poor  that  the 
model  is  judged  invalid  for  the  data;  alternatively,  if  the  probability  is  greater 
than  0.05,  the  fit  is  considered  sufficiently  close  that  the  model  is  not  rejected. 
Because  the  probability  in  the  MN  example  is  0.67,  which  is  greater  than  0.05, 
we  have  no  reason  to  reject  the  hypothesis  that  the  genotype  frequencies  are 
in  Hardy-Weinberg  proportions  for  this  gene. 


Probability  of  fit  as  bad  or  worse  by  chance  (P) 
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Figure  3.3  Graph  of  yj.  To  use  the  graph,  find  the  value  of  y2  along  the  hori- 
zontal axis,  then  read  the  probability  value  for  the  appropriate  number  of 
degrees  of  freedom  from  the  vertical  axis.  (From  Hartl  1994.) 


PROBLEM  3.4  In  the  Ss  blood  group,  related  to  the  MN  system, 
three  phenotypes  corresponding  to  the  genotypes  SS,  Ss,  and  ss  can 
be  identified  by  appropriate  reagents.  Among  the  same  1000  British 
people  who  gave  the  MN  data  above,  the  observed  number  of  each 
genotype  for  the  Ss  blood  groups  were  99  SS,  418  Ss,  and  483  ss. 
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Estimate  the  allele  frequency  of  S ( p ) and  s (q)  and  carry  out  a jf  test  of 
goodness  of  fit  between  the  observed  genotype  frequencies  and  their 
Hardy-Weinberg  expectations.  Is  there  any  reason  to  reject  the 
hypothesis  of  Hardy-Weinberg  proportions  for  this  gene? 


ANSWER  p = 0.308  and  q = 0.692.  The  expected  numbers  of  SS,  Ss, 
and  ss  are  94.86,  426.27,  and  478.86,  respectively.  The  %2  = 0.377  with 
one  degree  of  freedom.  The  associated  probability  from  Figure  3.3  is 
about  0.55,  so  there  is  no  reason  to  reject  the  hypothesis  of  HWE. 


Complications  of  Dominance 

Dominance  obscures  the  one-to-one  relation  between  phenotype  and  geno- 
type, but  the  allele  frequencies  can  still  be  estimated  if  one  is  willing  to 
assume  HWE.  For  a polymorphic  gene  with  two  alleles  in  which  one  of  the 
alleles  is  dominant,  only  two  phenotypic  classes  can  be  distinguished — the 
dominant  phenotype  and  the  recessive  phenotype.  An  example  is  the  D allele 
in  the  human  Rh  blood  groups,  which  codes  for  an  Rh+  antigen  present  on 
the  surface  of  red  blood  cells.  An  alternative  allele  designated  d,  fails  to  code 
for  the  antigen.  The  allele  D is  dominant  over  d because  both  DD  and  Dd 
genotypes  produce  the  Rh+  antigen.  The  genotypes  DD  and  Dd  therefore 
have  the  Rh+  phenotype  and  are  said  to  be  Rh  positive ; the  dd  genotype  has 
the  phenotype  Rh'  and  is  said  to  be  Rh  negative.  At  the  molecular  level,  the 
Dd  genotype  might  be  expected  to  produce  only  half  as  much  antigen  as  DD 
because  it  contains  only  one  D allele,  but  the  phenotype  is  nevertheless  Rh 
positive. 

Among  American  Caucasians,  the  frequency  of  Rh+  is  about  85.8%  and 
the  frequency  of  Rh'  is  about  14.2%  (Mourant  et  al.  1976).  Given  only  the 
phenotype  frequencies,  the  data  cannot  be  used  to  calculate  the  genotype  fre- 
quencies because  we  have  no  way  of  knowing  what  proportion  of  Rh+  phe- 
notypes are  DD  and  what  proportion  are  Dd.  However,  if  we  are  willing  to 
assume  random  mating,  then  the  relative  proportions  DD  and  Dd  genotypes 
are  given  by  the  Hardy-Weinberg  principle.  Assuming  random  mating  and 
HWE,  the  genotype  frequencies  are  given  by  p2,  2 pq,  and  q2,  where  p is  the 
allele  frequency  of  D.  An  estimate  of  q can  therefore  be  obtained  by  setting  q2 
~ 0-142  (the  frequency  of  the  homozygous  recessive  phenotype),  and  so  q = 
A/0T42  = 0.3768.  More  generally,  if  R is  the  frequency  of  homozygous  reces- 
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sive  genotypes  found  in  sample  of  n organisms,  then  q and  its  standard  error 
are  estimated  as 


<7  = VR 


With  q estimated  from  Equation  3.4  as  0.3768,  then  p = 1 - 0.3768  = 
0.6232,  and  the  frequencies  of  DD,  Dd,  and  dd  are  expected  to  be  p2  = (0.6232)2 
= 0.3884,  2pq  = 2(0.6232)(0.3768)  = 0.4696,  and  q2  = (0.3768)2  = 0.1420,  respec- 
tively. The  proportion  of  Rh+  people  that  are  actually  heterozygous  is  there- 
fore 0.4696/ (0.4696  + 0.3884)  = 54.7%.  However,  when  there  is  dominance, 
there  is  no  possibility  for  a > f test  of  goodness  of  fit  to  HWE  because  there  are 
0 degrees  of  freedom.  The  lack  of  degrees  of  freedom  is  the  reason  why  the 
calculated  frequencies  of  Rh+  and  Rh“  (0.3884  + 0.4696  = 0.858  and  0.142, 
respectively)  fit  the  observed  frequencies  exactly. 


PROBLEM  3.5  The  Basque  people,  who  live  in  the  Pyrenees  moun- 
tains between  France  and  Spain,  have  one  of  the  highest  frequencies 
of  the  d allele  in  the  Rh  system  so  far  reported.  In  one  study  of  400 
Basques,  230  were  found  to  be  Rh+  and  170  Rh  (Mourant  et  al.  1976). 
Estimate  the  frequencies  of  the  D and  d alleles,  the  genotype  frequen- 
cies, and  the  proportion  of  Rh+  people  who  are  heterozygous  Dd. 
What  is  the  standard  error  of  the  estimate  ql 


ANSWER:  q = V(170/ 400)  = 0.65,  p = 0.35,  and  the  estimated  geno- 
type frequencies  of  DD,  Dd,  and  dd  are  0.121, 0.454,  and  0.425,  respec- 
tively. The  proportion  of  Dd  among  Rh+  phenotypes  in  the  Basque 
population  is  0.454/ (0.121  + 0.454)  = 79%.  The  standard  error  of  q 
equals  V[(l  - 0.425)/ 1600]  = 0.02. 


The  Hardy-Weinberg  principle  also  finds  application  in  studies  of  industri- 
al melanism,  one  of  the  most  famous  and  best-studied  cases  of  evolution  in 
action  (Kettlewell  1973).  Industrial  melanism  refers  to  the  evolution  of  black 
(melanic)  color  patterns  in  several  species  of  moths  that  accompanied  progres- 
sive pollution  of  the  environment  by  coal  soot  during  the  industrial  revolution. 
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(The  various  color  forms  of  the  moths  are  known  as  morphs.)  The  evolution  of 
melanism  has  been  observed  in  Great  Britain,  West  Germany,  Eastern  Europe, 
the  United  States,  and  in  other  heavily  industrialized  areas.  The  species  that 
evolve  melanism  are  typically  large  moths  that  fly  by  night  and  rest  in  a sort  of 
cataleptic  state  by  day,  often  on  the  trunks  of  trees,  using  their  cryptic  black- 
and-white  mottled  color  pattern  for  concealment  from  visually  cued  predators 
such  as  hedge  sparrows,  redstarts,  and  robins  (Figure  3.4).  Of  nearly  800 
species  of  large  moths  in  the  British  Isles,  where  industrial  melanism  has  been 
most  intensively  studied,  about  100  species  are  industrial  melanics  (Bishop  and 
Cook  1975).  The  best  known  of  these  are  the  peppered  moth  ( Bistort  betularia) 
and  the  scalloped  hazel  moth  ( Gonodontis  bidentata).  In  most  instances,  the 
melanic  color  pattern  has  been  found  to  be  due  to  a single  dominant  allele. 


PROBLEM  3.6  In  one  study  of  a heavily  polluted  area  near  Birm- 
ingham, England,  Kettlewell  (1956)  observed  a frequency  of  87% 
melanic  Bistort  betularia.  Estimate  the  frequency  of  the  dominant  allele 
leading  to  melanism  in  this  population  and  the  frequency  of  melan- 
ics that  are  heterozygous. 


Figure  3.4  Melanic  and  nonmelanic  moths,  showing  camouflage  of  light  moths 
on  light  background  and  dark  moths  on  dark.  (Photograph  by  H.  B.  D.  Kettlewell.) 
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ANSWER  The  observed  frequency  of  homozygous  recessives  is  R = 
0.13,  and  so  the  frequency  of  recessive  allele  is  estimated  as  q = 
V( 0.13)  = 0.36.  Assuming  random  mating,  the  expected  frequencies  of 
dominant  homozygotes,  heterozygotes,  and  recessive  homozygotes 
are  0.41,  0.46,  and  0.13,  respectively.  The  proportion  of  melanics  that 
are  heterozygous  is  0.46/0.87  = 52.9%. 


Frequency  of  Heterozygotes 

The  Hardy-Weinberg  principle  also  has  important  implications  for  the  fre- 
quency of  heterozygotes  carrying  rare  recessive  alleles.  The  graphs  in  Figure 
3.5  depict  the  frequencies  of  AA,  Aa,  and  aa  in  a population  in  HWE.  The  het- 
erozygotes are  most  frequent  when  the  allele  frequencies  are  0.5.  Suppose 
that  the  allele  a is  a recessive,  and  consider  the  curves  as  the  allele  frequency 
of  a goes  toward  0.  As  a becomes  rare,  the  frequencies  of  recessive  homozy- 
gotes and  heterozygotes  both  decrease,  but  the  frequency  of  the  recessive 
homozygote  is  much  lower.  As  the  frequency  of  a goes  to  0,  the  frequency  of 
recessive  homozygotes  goes  to  0 at  a rate  of  q 1 , whereas  the  frequency  of  het- 
erozygotes goes  to  0 at  a rate  of  2 pq.  The  result  is  that  the  ratio  of  heterozy- 
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Figure  3.5  Frequencies  of  AA,  An,  and  aa  genotypes  with  HWE.  Note  that,  as 
either  allele  becomes  more  rare,  the  frequency  of  homozygotes  for  that  allele  is 
much  lower  than  the  frequency  of  heterozygotes. 
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gotes  to  recessive  homozygotes  increases  without  limit  as  the  recessive  allele 
becomes  rare. 

To  illustrate  the  principle,  suppose  q = 0.10;  then  2 pq/q2  = 18,  meaning 
that  there  are  18  times  as  many  heterozygotes  as  recessive  homozygotes.  For 
q = 0.01,  to  take  a more  extreme  example,  the  ratio  is  198;  and  for  q = 0.001, 
the  ratio  is  1998.  These  examples  demonstrate  that  when  a recessive  allele  is 
rare,  most  genotypes  containing  the  rare  allele  are  heterozygous. 

Quantitatively,  the  ratio  of  heterozygotes  to  homozygotes  equals  2 pq /q2  = 
2/q-2  which,  for  small  q,  is  approximately  2/q.  Consequently,  the  excess  of 
heterozygotes  over  homozygotes  becomes  progressively  greater  as  the  reces- 
sive allele  becomes  more  rare.  To  take  a real  example,  consider  cystic  fibro- 
sis, an  autosomal-recessive  defect  in  chloride  transport  characterized  by 
abnormal  glandular  secretions,  impaired  digestion,  frequent  respiratory 
infections,  and  other  serious  symptoms.  The  frequency  of  the  homozygous 
recessive  genotype  in  newborn  Caucasians  is  approximately  1 in  1700. 
For  this  allele,  q = V(l/1700)  = 0.024.  Assuming  random  mating,  the  fre- 
quency of  heterozygotes  is  estimated  as  2(0.024)(1  - 0.024)  = 0.047,  or  about  1 
in  21.  In  other  words,  although  only  1 person  in  1700  is  actually  affected  with 
cystic  fibrosis,  1 person  in  21  is  a heterozygous  carrier  of  the  harmful  allele. 


PROBLEM  3.7  Phenylketonuria  is  a defect  in  phenylalanine  metab- 
olism caused  by  lack  of  a functioning  allele.  Over  200  defective  alle- 
les have  been  identified  and  most  affected  individuals  are  actually 
heterozygous  for  two  different  defective  alleles.  The  condition  affects 
about  1 in  10,000  newborn  Caucasians.  Estimate  the  frequency  of  het- 
erozygotes for  the  normal  and  a defective  allele  under  the  assumption 
of  random  mating. 


ANSWER  About  1 person  in  50  carries  a defective  allele. 


SPECIAL  CASES  OF  RANDOM  MATING 

In  this  section  we  extend  the  Hardy-Weinberg  principle  to  multiple  alleles 
and  to  genes  located  on  the  X chromosome. 

Three  or  More  Alleles 

Genotype  frequencies  under  random  mating  for  genes  with  three  alleles  are 
shown  in  Figure  3.6.  Here  it  is  convenient  to  label  the  alleles  as  Au  A2/  and  A3 
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Figure  3.6  Cross-multiplication  square  showing  Hardy-Weinberg  frequencies 
for  three  autosomal  alleles. 


and  the  corresponding  allele  frequencies  as  px,  p2r  and  p3.  Because  there  are 
only  three  alleles.  px  + p2  + p3  = 1.  With  three  alleles  there  are  six  diploid  geno- 
types, and  under  random  mating  their  expected  frequencies  are  as  follows: 


AXAX 

p? 

axa2 

ZpiPi 

z4.2-/^2 

Pi 
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2pip3 

a2a3 

2p2P3 

A3A3 

P3 

These  frequencies  can  be  obtained  by  expanding  (px  Ax  + p2  A2  + p3  A3)2, 
which  the  cross-multiplication  square  in  Figure  3.6  does  automatically. 

Application  of  Figure  3.6  can  be  illustrated  with  the  familiar  ABO  blood 
groups  in  humans.  The  ABO  blood  groups  are  controlled  by  three  alleles  des- 
ignated 1°,  lA , and  lB.  Genotypes  IAIA  and  IAI°  have  blood  type  A;  genotypes 


90 


Chapter  3 


I 1 and  IB1L  have  blood  type  B,  genotype  I°I°  has  blood  type  O,  and  geno- 
type  1AIB  has  blood  type  AB.  In  one  test  of  6313  Caucasians  in  Iowa  City,  the 
number  of  people  with  blood  types  A,  B,  O,  and  AB  was  found  to  be  2625, 
570,  2892,  and  226,  respectively  (Mourant  et  al.  1976).  The  best  estimates 
of  allele  frequency  in  this  case  are  px  = 0.2593  (for  IA),  p2  = 0.0625  (for  IB ),  and 
p3  = 0.6755  (for  1°).  (Estimation  of  allele  frequencies  for  the  ABO  blood  groups 
is  complicated  because  of  dominance;  for  methods  see  Cavalli-Sforza  and 
Bodmer  1971  and  Vogel  and  Motulsky  1986.)  The  expected  (and  observed) 
numbers  of  the  four  blood-type  phenotypes  are  therefore: 


A: 

(0.25932  + 2 x 0.2593  x 0.6755)  x 6313  = 2636.0 

(observed  2625) 

B: 

(0.06522  + 2 x 0.0652  x 0.6755)  x 6313  = 582.9 

(observed  570) 

O: 

0.67552  x 6313  = 2880.6 

(observed  2892) 

AB: 

(2  x 0.2593  x 0.0652)  x 6313  = 213.5 

(observed  226) 

The  for  goodness  of  fit  to  Hardy-Weinberg  proportions  is  1.11.  There  is 
one  degree  of  freedom  for  this  test:  4 (to  start  with)  - 1 (for  fixing  the  total  at 
6313)  - 1 (for  estimating  px  from  the  data)  - 1 (for  estimating  p2from  the 
data);  a degree  of  freedom  is  not  deducted  for  estimating  p3  because  p3  = 1 - 
pi  - p2-  For  a x2  of  1.11  with  one  degree  of  freedom,  the  associated  probabili- 
ty from  Figure  3.3  is  about  0.30,  and  so  the  Iowa  City  population  gives  no  evi- 
dence against  Hardy-Weinberg  proportions  for  this  gene. 


PROBLEM  3.8  In  a sample  of  1617  Spanish  Basques,  the  numbers  of 
A,  B,  O,  and  AB  blood  types  observed  were  724,  110,  763,  and  20, 
respectively  (Mourant  et  al.  1976).  The  best  estimates  of  allele  fre- 
quency ar epx  = 0.2661  (for  IA),p2  = 0.0411  (for  IB),  and  p3  = 0.6928  (for 
I ).  Calculate  the  expected  numbers  of  the  four  phenotypes  and  carry 
out  a x2  test  for  goodness  of  fit  to  the  Hardy-Weinberg  expectations. 


ANSWER  The  expected  numbers  of  A,  B,  O,  and  AB  are  710.7,  94.8, 
776.1,  and  35.4,  respectively.  The  x2  equals  9.61  with  one  degree  of 
freedom,  for  which  the  corresponding  probability  is  0.0025.  Because 
a deviation  as  large  or  larger  than  that  observed  would  be  expected  by 
chance  in  only  0.0025  samples  (that  is,  about  1 in  400),  there  is  very 
good  reason  to  reject  the  hypothesis  that  the  genotypes  are  in  Hardy- 
Weinberg  proportions  in  this  population.  The  reason  for  the  discrep- 
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ancy  is  not  known.  One  likely  possibility  is  migration  into  the  popu- 
lation by  people  with  allele  frequencies  that  are  significantly  different 
from  those  among  the  Basques  themselves. 


PROBLEM  3.9  Among  many  aboriginal  American  Indian  tribes,  the 
allele  frequency  of  IB  is  extremely  low.  For  example,  a sample  of  600 
Papago  Indians  from  Arizona  included  37  A and  563  O blood  types 
(Mourant  et  al.  1976).  What  are  the  best  estimates  of  the  allele  fre- 
quencies of  IA,  IB,  and  1°  in  this  population,  and  what  are  the  expect- 
ed genotype  frequencies  assuming  random  mating? 


ANSWER  There  are  no  IB  alleles  in  the  sample,  so  the  best  estimate 
of  p2  is  0.  Thus,  there  are  only  two  alleles  IA  and  1°  with  IA  dominant. 
The  best  estimate  of  p3  is  thus  obtained  from  Equation  3.4  as 
V(563/600)  = 0.9687  and  that  of  px  as  1 - p3  = 0.0313.  The  expected 
genotype  frequencies  are  0.03132  = 0.0010  for  IAIA,  2(0.0313)(0.9687)  = 
0.0606  for  IAI°,  and  0.96872  = 0.9384  for  l°l°. 


In  general,  if  there  are  n alleles 


Ai,  A2, . . . , A„ 

with  respective  frequencies 

Pi/  P2,  • ■ ■ , P„ 

(and  P\  + p2  + ■ ■ ■ + pn  = 1),  then  the  genotype  frequencies  expected  under 
random  mating  are 

p}  for  A, A,  homozygotes 

„ , 3.5 

2 pipj  for  AjAj  heterozygotes 

Equation  3.5  may  be  applied  to  data  on  allozyme  polymorphisms  in 
Drosophila  persimilis  in  California.  One  sample  of  108  adult  flies  from  the  Fish 
Creek  population  included  four  alleles  of  the  gene  Xdh,  which  codes  for 
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xanthine  dehydrogenase.  We  may  call  the  alleles  Xdh-1,  Xdh-2,  Xdh-3,  and 
Xdh-4;  their  respective  frequencies  were  estimated  as  p,  = 0.08,  p2  = 0.21,  p3  = 
0.62,  and  p4=  0.09  (Prakash  1977).  With  four  alleles,  there  are  four  possible 
homozygotes  (for  example,  Xdh-l/Xdh-1)  and  six  possible  heterozygotes  (for 
example,  Xdh-l/Xdh-2).  In  a random-mating  population,  the  frequency  of  any 
homozygous  genotype  is  expected  to  be  the  square  of  the  corresponding 
allele  frequency.  For  example,  the  frequency  of  Xdh-l/Xdh-1  is  expected  to  be 
P i / and  the  frequency  of  any  heterozygous  genotype  is  expected  to  be  two 
times  the  product  of  the  corresponding  allele  frequencies.  For  example,  the 
frequency  of  Xdh-l/Xdh-2  is  expected  to  be  2 pxp2.  The  Hardy- Weinberg  fre- 
quencies for  all  10  possible  genotypes  can  be  obtained  by  expanding  the 
expression  (0.08  Xdh-1  + 0.21  Xdh-2  + 0.62  Xdh-3  + 0.09  Xdh-4)2. 


PROBLEM  3.10  Four  alleles  of  the  gene  Adh  coding  for  alcohol 
dehydrogenase  were  found  in  a Texas  population  of  Phlox  cuspidata 
(Levin  1978).  The  alleles  may  be  designated  Adh-1,  Adh-2,  Adh-3,  and 
Adh-4.  Their  frequencies  were  estimated  as  0.11,  0.84,  0.01,  and  0.04, 
respectively.  What  are  the  expected  Hardy-Weinberg  proportions  of 
the  10  genotypes? 


ANSWER  Adh-l/Adh-1:  0.112  = 0.0121;  Adh-l/Adh-2:  2(0.11)(0.84)  = 
0.1848;  Adh-2/ Adh-2  = 0.842  = 0.7056;  Adh-l/Adh-3  = 2(0.11)(0.01)  = 
0.0022;  Adh-2/Adh-3  = 2(0.84)(0.01)  = 0.0168;  Adh-3/Adh-3  = 0.012 
= 0.0001;  Adh-l/Adh-4  = 2(0.11)(0.04)  = 0.0088;  Adh-2/Adh-4  = 
2(0.84)(0.04)  = 0.0672;  Adh-3/ Adh-4  = 2(0.01)(0.04)  = 0.0008;  Adh-4/ Adh- 
4 = 0.042  = 0.0016.  It  should  be  pointed  out  that  the  observed  genotype 
frequencies  were  nowhere  near  the  Hardy-Weinberg  expectations 
because  Phlox  cuspidata  undergoes  a substantial  frequency  of  self- 
fertilization  (about  78%),  which  violates  the  assumption  of  random 
mating.  How  to  deal  with  such  departures  from  random  mating  is 
discussed  in  Chapter  4. 


X-Linked  Genes 

An  important  exception  to  the  rule  that  diploid  organisms  contain  two  alleles 
of  every  gene  applies  to  genes  on  the  X and  Y chromosomes.  In  mammals 
and  many  insects,  females  have  two  copies  of  the  X chromosome  whereas 
males  have  one  X chromosome  and  one  Y chromosome.  The  X and  Y 
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chromosomes  segregate,  and  so  half  the  sperm  from  a male  carry  the  X chro- 
mosome and  half  carry  the  Y chromosome.  Although  the  Y chromosome  car- 
ries very  few  genes  other  than  those  involved  in  the  determination  of  sex  and 
male  fertility,  the  X chromosome  carries  as  full  a complement  of  genes  as 
any  other  chromosome.  Genes  on  the  X chromosome  are  called  X-linked 
genes,  and  the  important  consequence  of  X linkage  is  that  a recessive  allele 
on  the  X chromosome  in  a male  is  expressed  phenotypically  because  the  Y 
chromosome  lacks  any  compensating  allele.  For  X-linked  genes  with  two 
alleles,  therefore,  there  are  three  female  genotypes  (AA,  An,  and  a a)  but  only 
two  male  genotypes  {A  and  a). 

The  consequences  of  random  mating  with  two  X-linked  alleles  are  shown 
in  Figure  3.7,  where  the  alleles  are  denoted  X'4  and  X".  Note  that  in  females, 
which  have  two  X chromosomes,  the  genotype  frequencies  are  as  given  by 
the  Hardy- Weinberg  principle  in  Equation  3.1;  in  males,  which  have  only  one 
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Figure  3.7  Consequences  of  random  mating  with  X-linked  genes.  Genotype 
frequencies  in  females  equal  the  Hardy-Weinberg  frequencies,  and  genotype 
frequencies  in  males  equal  the  allele  frequencies. 
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X chromosome,  the  genotype  frequencies  are  equal  to  the  allele  frequencies. 
The  calculations  in  Figure  3.7  are  valid  only  if  the  allele  frequencies  are  iden- 
tical in  eggs  and  sperm.  When  they  differ,  approximate  equality  of  allele  fre- 
quencies in  the  sexes  is  usually  attained  for  X-linked  genes  in  a period  of  10 
or  so  generations  of  random  mating  because,  in  each  generation,  any  allele 
frequency  in  female  zygotes  is  the  average  of  the  frequency  of  the  allele  in 
male  and  female  parents  in  the  previous  generation. 


PROBLEM  3.11  The  human  Xg  blood  group  is  controlled  by  an  X- 
linked  gene  with  two  alleles,  designated  Xg a and  Xg.  Two  phenotypes 
can  be  distinguished  by  means  of  the  appropriate  antisera,  Xg(a+)  and 
Xg(a-).  Xga  is  dominant  to  Xg,  and  so  females  of  genotype  Xga/Xga 
and  Xg“/Xg  have  blood  type  Xg(a+),  whereas  females  of  genotype 
Xg/Xg  are  phenotypically  Xg(a-).  Males  of  genotype  Xg*  have  blood 
type  Xg(a+);  those  of  genotype  Xg  have  blood  type  Xg(a-).  In  a sam- 
ple of  2082  British  people,  there  were  967  Xg(a+)  females,  66 7 Xg(a+) 
males,  102  Xg(a-)  females,  and  346  Xg(a-)  males  (Race  and  Sanger 
1975).  The  best  estimates  of  allele  frequency  are  p = 0.675  (for  Xg") 
and  q = 0.325  (for  Xg).  Calculate  the  expected  numbers  in  the  four 
phenotypic  classes,  assuming  random-mating  proportions,  and  carry 
out  a x test  for  goodness  of  fit.  (The  number  of  degrees  of  freedom  in 
this  case  is  1:  there  are  four  degrees  of  freedom  to  start  with;  one  must 
be  deducted  for  using  the  observed  number  of  males  in  calculating 
the  expectations  for  males;  one  must  be  deducted  for  using  the 
observed  number  of  females  in  calculating  their  expectations;  and  one 
more  must  be  deducted  for  estimating  p from  the  data.) 


ANSWER  The  expected  numbers  of  Xg(a+)  and  Xg(a-)  males  are 
0.675  x 1013  = 683.8  and  0.325  x 1013  = 329.2,  respectively.  The  expect- 
ed numbers  of  Xg(a+)  and  Xg(a-)  females  are  [0.6752  + 2(0.675)(0.325)] 
x 1069  = 956.1  and  0.3252  x 1069  = 112.9,  respectively.  The  x2  equals 
2.45  which,  as  noted  above,  has  one  degree  of  freedom.  The  associat- 
ed probability  is  about  0.12  (Figure  3.3),  and  so  there  is  no  reason  to 
reject  the  hypothesis  of  random-mating  proportions. 


One  of  the  important  features  of  random  mating  for  X-linked  genes  is  that 
phenotypes  resulting  from  a recessive  allele  will  be  more  common  in  males 
than  in  females.  In  Problem  3.11,  for  example,  the  proportion  of  Xg(a-)  males 
is  346/1013  = 34%,  whereas  the  proportion  of  Xg(a-)  females  is  only 
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102/1069  = 10%.  There  is  always  an  excess  of  affected  males  because  q (which 
equals  the  proportion  of  males  with  the  recessive  phenotype)  will  always  be 
greater  than  q (which  is  the  proportion  of  females  with  the  recessive  pheno- 
type). Indeed,  the  discrepancy  grows  larger  as  the  recessive  allele  becomes 
more  rare.  For  example,  with  the  X-linked  "green"  type  of  color  blindness,  q 
= 0.05  in  Western  Europeans,  and  so  the  ratio  of  affected  males  to  affected 
females  is  q/q2  = l/q  = 1/0.05  = 20.  In  contrast,  for  the  X-linked  "red"  type 
of  color  blindness,  q = 0.01  and  so,  in  this  case,  the  ratio  of  affected  males  to 
affected  females  is  1/0.01  = 100. 


PROBLEM  3.12  California  populations  of  Drosophila  persimilis  have 
two  alleles  of  an  X-linked  gene  coding  for  allozymes  of  phosphoglu- 
comutase-1  (Policansky  and  Zouros  1977).  The  alleles  may  be  desig- 
nated Pgm-1A  and  Pgm-1B;  their  estimates  frequencies  were  0.25  and 
0.75,  respectively.  Assuming  random-mating  proportions,  what  are 
the  expected  genotype  frequencies  in  males  and  females? 


ANSWER  In  males,  Pgm-lA  at  0.25  andPgm-2®  at  0.75.  In  females, 
Pgm-1A/Pgm-1A  at  0.252  = 0.0625;  Pgm-lA / Pgm-1B  at  2(0.25)(0.75)  = 
0.3750;  Pgm-1B/Pgm-1B  at  0.752  = 0.5625. 


Before  leaving  the  subject  of  X-linkage,  it  is  necessary  to  point  out  that 
certain  species — among  them,  birds,  moths,  and  butterflies — have  the  sex- 
chromosome  situation  backwards.  In  these  species,  females  are  XY  and  males 
XX.  The  consequences  of  random  mating  are  the  same  as  otherwise,  except 
that  the  sexes  are  reversed. 

LINKAGE  AND  LINKAGE  DISEQUILIBRIUM 

With  random  mating,  the  alleles  of  any  gene  are  combined  at  random  into 
genotypes  according  to  frequencies  given  by  the  Hardy- Weinberg  propor- 
tions. To  be  specific,  imagine  a gene  with  two  alleles,  call  them  A1  and  A2,  at 
frequencies  p1  and  p2,  respectively,  where  pl  + p2  = 1.  Then  the  Hardy- 
Weinberg  principle  tells  us  that  genotypes  A^Alf  A,A2,  and  A2A2  are  expected 
in  the  proportions  p\,  lp\p2,  and  p\,  respectively,  provided  that  mating  is 
random. 

Similarly,  we  may  consider  a different  gene  with  alleles  Bj  and  B2  at  fre- 
quencies q-i  and  q2,  respectively,  where  qi+q2=l.  Then  the  Hardy- Weinberg 
principle  tells  us  again  that  the  genotype  frequencies  of  BjBi,  B2B2,  and  B2B2 
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are  expected  in  the  proportions  q\,  lqxq2,  and  q\,  respectively,  provided  that 
mating  is  random.  Thus,  the  A ^ allele  is  in  random  association  with  the  A2 
allele,  and  the  B1  allele  is  in  random  association  with  the  B2  allele.  Strange  as 
it  may  seem,  the  alleles  of  the  A gene  may  nevertheless  fail  to  be  in  random 
association  with  the  alleles  of  the  B gene.  The  precise  meaning  of  "random 
association"  is  illustrated  in  Figure  3.8.  In  this  figure  the  squares  refer  to  the 
alleles  present  in  gametes,  not  to  genotypes  as  in  earlier  diagrams.  When  the 
alleles  of  the  genes  are  in  random  association,  the  frequency  of  a gamete  car- 
rying any  particular  combination  of  alleles  equals  the  product  of  the  fre- 
quencies of  those  alleles.  Genes  that  are  in  random  association  are  said  to  be 
in  a state  of  linkage  equilibrium,  and  genes  not  in  random  association  are 
said  to  be  in  linkage  disequilibrium.  With  linkage  equilibrium,  therefore, 
the  gametic  frequencies  are: 


A1B1: 

P'1  x 

A\B2: 

Pi  X<?2 

A2B2: 

A2B2' 

p2XCj2 

With  random  mating  and  the  other  simplifying  assumptions  listed  earlier 
(including  a large  population  with  no  mutation,  migration,  or  selection),  link- 
age equilibrium  between  genes  is  eventually  attained.  However,  linkage 
equilibrium  is  attained  gradually,  and  the  rate  of  approach  can  be  very  slow. 
The  slow  approach  to  linkage  equilibrium  stands  in  contrast  to  the  attain- 
ment of  HWE  with  alleles  of  a single  gene,  which  typically  requires  just  one 
generation  (when  generations  are  nonoverlapping)  or  a relatively  small  num- 
ber of  generations  (when  generations  are  overlapping). 

The  rate  of  approach  to  linkage  equilibrium  depends  on  the  rate  of  recom- 
bination in  genotypes  heterozygous  for  both  genes.  There  are  two  types  of 
double  heterozygotes: 

A\B-[  / A2B2 
A\B2/ A2B i 

In  the  first  case,  the  genotype  was  formed  by  the  union  of  an  A:Bi  gamete 
with  an  A2B2  gamete.  In  the  second  case,  the  genotype  was  formed  by  the 
union  of  an  A2B2  gamete  with  an  A2B\  gamete.  For  the  moment,  consider  the 
genotype  A2BX/ A2B2.  The  gametes  produced  by  this  genotype  are  of  four 
types:  (1)  AjBj,  (2)  A2B2,  (3)  AiB2,  and  (4)  A2B x.  Gametic  types  1 and  2 are 
known  as  nonrecombinant  gametes  because  the  alleles  are  associated  in  the 
same  manner  as  in  the  previous  generation  (specifically,  A1  with  B2  and  A2 
with  B2).  Gametic  types  3 and  4 are  known  as  recombinant  gametes  because 
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Alleles  of  A gene 


Allele  Al  A2 

Frequency  p1  p2 


Allele  Frequency 

Bi  <?i 

Alleles 
of  B gene 

B2  q2 


AiBj 

A2Bx 

Pi<h 

Pill 

AxB2 

A2B2 

V\<h 

P2<?2 

Figure  3.8  Random  association  between  two  alleles  of  each  of  two  genes, 
showing  expected  gametic  frequencies  when  the  alleles  are  in  linkage 
equilibrium. 


the  alleles  are  associated  differently  than  in  the  previous  generation  (specifi- 
cally, A\  with  B2  and  A2  with  Bx). 

Because  of  Mendelian  segregation,  the  frequency  of  gametic  type  1 equals 
that  of  type  2,  and  the  frequency  of  gametic  type  3 equals  that  of  type  4.  That 
is,  the  two  nonrecombinant  gametes  are  formed  in  equal  frequencies,  and  the 
two  recombinant  gametes  are  formed  in  equal  frequencies.  However,  the 
overall  frequency  of  recombinant  gametes  (type  3 + type  4)  does  not  neces- 
sarily equal  the  overall  frequency  of  nonrecombinant  gametes  (type  1 + type 
2)  except  in  special  cases.  The  term  recombination  fraction,  usually  symbol- 
ized r,  refers  to  the  proportion  of  recombinant  gametes  produced  by  a double 
heterozygote.  Suppose,  for  example,  that  the  genotype  AXBX/ A2B2  produces 
gametes  AXBX,  A2B2,  AXB2,  and  A2BX  in  the  proportions  0.38,  0.38,  0.12,  and 
0.12,  respectively.  Then  the  recombination  fraction  between  the  genes  is  r = 
0.12  + 0.12  = 0.24. 

The  recombination  fraction  between  genes  depends  on  whether  they  are 
present  on  the  same  chromosome  and,  if  so,  on  the  physical  distance  between 
them.  For  genes  on  different  chromosomes,  the  recombination  fraction  is  r = 
0.5  because  the  four  possible  gametic  types  are  produced  in  equal  frequency. 
For  genes  on  the  same  chromosome,  the  recombination  fraction  depends  on 
their  distance  apart,  because  each  chromosome  aligns  side-by-side  with  its 
partner  chromosome  in  meiosis  and  can  undergo  a sort  of  breakage  and 
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reunion  resulting  in  an  exchange  of  parts  between  the  partner  chromosomes 
The  closer  two  genes  are,  the  less  likely  that  a breakage  and  reunion  takes 
place  in  the  region  between  the  genes;  the  farther  apart  two  genes  are,  the 
more  likely  such  an  event  becomes.  The  smallest  possible  recombination  frac- 
tion is  r = 0,  which  would  imply  that  the  two  genes  are  so  close  together  that 
a break  never  takes  place  between  them.  The  largest  possible  recombination 
fraction  is  r = 0.5,  which  is  found  when  genes  are  very  far  apart  on  the  same 
chromosome  or,  as  noted  above,  when  they  are  on  different  chromosomes. 
Genes  for  which  the  recombination  fraction  is  less  than  0.5  must  necessarily 
be  on  the  same  chromosome,  and  such  genes  are  said  to  be  linked. 

To  sum  up,  if  the  recombination  fraction  between  the  A and  B genes  is  denot- 
ed r,  then  the  genotype  AXBJ  A2B2  produces  the  following  types  of  gametes: 

/l] 6}  with  frequency  (1  - r) /2 

A2B2  with  frequency  (1  - r)/2 

A^B2  with  frequency  r / 2 

A2B j with  frequency  r/2 

The  situation  in  A}B2/ A XB2  genotype  is  much  the  same,  but  there  is  one 
important  difference.  In  this  case,  the  A^B^  and  A2B2  gametes  are  the  recombi- 
nant types,  and  the  A1B2  and  A2B j gametes  are  the  nonrecombinant  types.  Thus, 
the  genotype  A^B2/A^B2  produces  the  following  types  of  gametes: 

A\Bi  with  frequency  r/2 

A2B2  with  frequency  r/2 

A\B2  with  frequency  (1  - r)/2 

A2BX  with  frequency  (1  - r)/ 2 


PROBLEM  3.13  The  genes  for  the  human  MN  and  Ss  blood  groups 
discussed  in  Problem  3.4  are  close  together  on  the  same  chromosome. 
Suppose  that  the  recombination  fraction  between  the  genes  is  r = 0.01. 
What  types  and  frequencies  of  gametes  would  be  produced  by  a per- 
son of  genotype  MS /Ns?  By  a person  of  genotype  Ms /NS? 


ANSWER  The  MS /Ns  genotype  produces  gametic  types  MS,  Ns, 
Ms,  and  NS  in  proportions  (1  - 0.01)/2  = 0.495,  (1  - 0.01)/2  = 0.495,' 
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0.01/2  = 0.005,  and  0.01/2  = 0.005,  respectively.  The  Ms/NS  genotype 
produces  exactly  the  same  gametic  types,  but  their  frequencies  are 
0.005,  0.005,  0.495,  and  0.495,  respectively. 


The  recombination  fraction  between  genes  is  important  in  population 
genetics  because  it  governs  the  rate  of  approach  to  linkage  equilibrium.  To  be 
precise,  consider  a population  in  which  the  actual  frequencies  of  the  chromo- 
some types  among  gametes  are  as  follows: 

AiBi:  Pu 

A,B2:  Pn 

A2Bi.-  P2\ 

A2B2:  P 22 

where  Pn  + P12  + P2 1 + P22  = 1-  In  terms  of  the  gametic  frequencies,  linkage 
equilibrium  is  defined  as  the  state  in  which  Pu  = p1q1,  P12  = pxq2,  P2 1 = P2^i/ 
and  P2 2 = p2q2  (see  Figure  3.8). 

Suppose  that  the  genes  are  not  in  linkage  equilibrium.  To  determine  how 
rapidly  linkage  equilibrium  is  approached,  we  need  to  deduce  the  gametic  fre- 
quencies in  the  next  generation.  Consider  first  the  AXBX  gamete.  In  any  one 
generation,  a chromosome  carrying  AlBl  either  could  have  undergone  recom- 
bination between  the  genes  (an  event  with  probability  r,  where  r is  the  recom- 
bination fraction),  or  could  have  escaped  recombination  between  the  genes 
(an  event  with  probability  1 - r).  Among  the  AXBX  chromosomes  that  did  not 
undergo  recombination,  the  frequency  of  AXBX  is  the  same  as  it  was  in  the  pre- 
vious generation;  among  the  chromosomes  that  did  undergo  recombination, 
the  frequency  of  AXBX  chromosomes  is  simply  the  frequency  of  -B2  / A\- geno- 
types in  the  previous  generation,  where  the  dash  in  place  of  the  A and  B allele 
means  that  the  identity  of  that  particular  allele  is  irrelevant.  Because  mating  is 
random,  the  overall  frequency  of  — genotypes  is  pxqx.  Putting  all  the 
steps  in  the  argument  together,  the  frequency  of  A]Bl  in  any  generation,  call  it 
Pn',  is  related  to  the  frequency  Pu  in  the  previous  generation  by  the  equation 

Pn'=  (1  - r)  x Pn  [for  the  nonrecombinants] 

+ r x pxqx  [for  the  recombinants] 

Subtraction  of  pxqx  from  both  sides  leads  to 

pn  ~Pi‘h=0--r)(Pu-piqi) 


3.7 
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Equation  3.7  becomes  simplified  somewhat  by  defining  D as  the  differ- 
ence Pn  - picji.  Then  D„  is  the  value  of  D in  the  nth  generation,  and  Equation 
3.7  implies  that  D„  = (1  - r)D„_[.  The  solution  of  this  equation  is  found  by  suc- 
cessive substitution  as 

D„  = (1  - r)D = (1  - r)2  D„_2  = • • • = (1  - r)»  D„  3.8 

where  D0  is  the  value  of  D in  the  founding  population.  Because  1 - r < 1, 
(1  - r)n  goes  to  zero  as  n becomes  large,  but  how  rapidly  (1  - r)n  goes  to  zero 
depends  on  r;  the  closer  r is  to  zero,  the  slower  the  rate.  This  principle  is  illus- 
trated in  Figure  3.9.  Recall  here  that  r = 0.5  corresponds  either  to  genes  far 
apart  in  the  same  chromosome  or  to  genes  in  different  chromosomes. 
Because  (1  - r)n  goes  to  zero,  D goes  to  zero,  and  therefore  Pu  goes  to  p1q1 
unless  there  are  other  offsetting  processes.  Analogous  arguments  hold  for 
gametes  containing  AXB2,  A2Bh  or  A2B2/  and  so  Pn,  P21,  and  P22  go  to  pxq2,  p2qx, 
and  p2q2,  respectively.  Thus,  linkage  equilibrium  is  attained  at  a rate  deter- 
mined by  the  value  of  r. 


Figure  3.9  Linkage  disequilibrium  between  genes  gradually  disappears  when 
mating  is  random,  provided  there  is  no  countervailing  force  building  it  up.  The 
rate  of  approach  to  linkage  equilibrium  depends  on  the  recombination  frequen- 
cy between  the  genes.  The  disappearance  of  linkage  disequilibrium  is  gradual 
even  with  free  recombination  ( r = V2).  In  these  examples,  the  frequencies  of  both 
alleles  at  both  loci  equal  y2,  and  the  initial  linkage  disequilibrium  is  either  at  its 
maximum  (D  = 0.25)  or  minimum  (D  = - 0.25)  value,  given  these  allele  frequen- 
cies. 
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The  value  of  D that  holds  for  Pn  - pxqx  also  holds  for  the  other  possible 
gametes,  as  follows 


•Pn  - PN  1 + D 

Pn  = PNi  ~ D 

P21  = P2P1  _ d 

P22  = P2P2  + D 

The  quantity  D is  often  called  the  linkage  disequilibrium  parameter.  In 

terms  of  the  gametic  frequencies,  D can  be  shown  to  satisfy 


D - P11P22  - P12P2] 


3.9 


With  random  mating  and  no  countervailing  forces,  the  value  of  D changes 
according  to  Equation  3.8,  and  D = 0 corresponds  to  linkage  equilibrium.  Fur- 
thermore, Pn,  P12,  P2i,  and  P 22  must  all  be  nonnegative  and  so,  for  any  pre- 
scribed allele  frequencies  pv  p2,  qx,  and  q2,  the  smallest  possible  (Dmin)  and 
largest  possible  (Dmax)  values  of  D are  as  follows 


Dmin  = the  larger  of  and  -p2q2 
Dmax  = the  smaller  of  pxq2  and  p2qx 


3.10 


In  studies  of  linkage  disequilibrium,  estimation  of  the  gametic  frequen- 
cies Pn,  P12,  P2i,  and  P22  usually  requires  complex  statistical  procedures  rather 
than  straightforward  chromosome-counting  methods  because  there  are  10 
genotypes  but  usually  no  more  than  nine  phenotypes.  (There  are  10  geno- 
types because  AXBX/A2B2  and  AXB2/A2B1  must  be  distinguished.) 

An  example  of  linkage  disequilibrium  is  found  in  the  genes  controlling 
the  MN  and  Ss  blood  groups  in  human  populations.  Earlier  in  this  chapter, 
we  cited  data  from  1000  Britishers  with  respect  to  the  MN  blood  groups  and 
showed  that  the  genotypes  MM,  MN,  and  NN  are  in  Hardy-Weinberg  pro- 
portions. In  Problem  3.4,  data  from  the  same  1000  people  were  analyzed  with 
respect  to  the  Ss  blood  groups,  and  genotypes  SS,  Ss,  and  ss  were  also  found 
to  satisfy  the  Hardy-Weinberg  proportions.  In  order  to  discuss  linkage  dise- 
quilibrium between  the  genes,  it  will  be  convenient  to  use  the  symbols  px  and 
p2  for  the  allele  frequencies  of  M and  N,  respectively,  and  the  symbols  qt  and 
q2  for  the  allele  frequencies  of  S and  s,  respectively.  The  earlier  analyses  yield- 
ed estimates  of  px  = 0.5425  and  p2  = 0.4575  for  M and  N and  cp  = 0.3080  and 
q2  = 0.6920  for  S and  s.  Were  the  loci  in  linkage  equilibrium,  the  gametic  fre- 
quencies would  be  P\q\  for  MS,  pxq2  for  Ms,  p2qx  for  NS,  and  p2q2  for  Ns.  There- 
fore, among  the  1000  genotypes  (a  total  of  2000  chromosomes),  the  expected 
numbers  are  as  shown  in  the  third  column  below  (the  second  column  gives 
the  observed  numbers): 
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MS 

474 

0.5425  x 0.3080  x 2000  = 334.2 

Ms 

611 

0.5425  x 0.6920  x 2000  = 750.8 

NS 

142 

0.4575  x 0.3080  x 2000  = 281.8 

Ns 

773 

0.4575  x 0.6920  x 2000  = 633.2 

The  xZ  for  goodness  of  fit  is  184.7  with  one  degree  of  freedom:  4 (to  start 
with)  - 1 - 1 (for  estimating  p1  from  the  data)  - 1 (for  estimating  cp  from  the 
data)  = 1.  The  associated  probability  is  so  small  as  to  be  off  the  chart  in  Figure 
3.3,  and  consequently  it  is  very  much  less  than  0.0001.  This  result  means  that 
chance  alone  would  produce  a fit  as  poor  or  poorer  substantially  less  than 
one  time  in  10,000,  and  so  the  hypothesis  that  the  loci  are  in  linkage  equilib- 
rium can  confidently  be  rejected. 

To  quantify  the  amount  of  linkage  disequilibrium,  we  must  estimate  the 
gametic  frequencies  Pn,  P12,  P2 1,  and  P22: 


MS: 

Pn  = 474/2000  = 0.2370 

Ms: 

P12  = 611/2000  = 0.3055 

NS: 

P21  = 142/2000  = 0.0710 

Ns: 

P 22  = 773/2000  = 0.3865 

Thus,  D can  be  estimated  as  D = PnP22  - P12P2i  = 0.07.  From  Equation 
3.10,  Dmax  is  given  by  plq2  or  p2q  1,  whichever  is  smaller;  in  this  case,  p,q2  = 0.38 
and  p2cji  = 0.14,  hence  Dmax  = 0.14.  Therefore,  D/Dmax  = 0.07/0.14  = 50%, 
and  so  we  conclude  that  the  amount  of  disequilibrium  between  the  genes 
controlling  the  MN  and  Ss  blood  groups  is  about  50%  of  its  theoretical  maxi- 
mum. In  most  local  populations  of  sexual  organisms  that  regularly  avoid 
extreme  inbreeding  (mating  between  relatives)  values  of  D are  typically  zero 
or  close  to  zero  (indicating  linkage  equilibrium)  unless  the  genes  are  very 
closely  linked.  This  overall  conclusion  is  exemplified  in  the  following 
problems. 


PROBLEM  3.14  In  Drosophila  melanogaster,  the  genes  E6-EC-Odh 
are  linked  in  chromosome  3.  The  E6  and  EC  genes  are  rather  loosely 
linked  (r  = 0.122),  whereas  EC  and  Odh  are  tightly  linked  (r  = 0.002). 
The  recombination  fractions  are  those  in  females,  as  recombination 
does  not  take  place  in  males  of  this  species.  Using  the  data  from  the 
experimental  population  given  in  Problem  2.3  (page  47),  carry  out  an 
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analysis  to  determine  whether  there  is  linkage  disequilibrium 
between  £6  and  EC.  If  there  is  linkage  disequilibrium,  what  is  its 
magnitude  relative  to  the  theoretical  maximum  (or  minimum)  value? 


ANSWER  For  the  data  given  in  Problem  2.3,  the  observed  numbers 
of  the  four  chromosomal  types  E6F  ECF,  E6F  ECS,  E6S  ECr,  and  E6S  ECS 
were  159, 16,  277,  and  37,  respectively.  The  estimated  allele  frequen- 
cies of  E6F,  E6f,  ECf,  and  ECS  are  0.3579,  0.6421,  0.8916,  and  0.1084, 
respectively.  Assuming  linkage  equilibrium,  the  expected  numbers  of 
the  four  chromosomal  types  are  156.0, 19.0,  280.0,  and  34.0,  respec- 
tively. The  x2  value  with  one  degree  of  freedom  is  0.828,  for  which  the 
associated  probability  is  about  0.4.  Thus,  there  is  no  reason  to  reject 
the  hypothesis  that  E6  and  EC  are  in  linkage  equilibrium  in  this 
experimental  population. 


PROBLEM  3.15  Carry  out  an  analysis  of  linkage  disequilibrium  for 
the  genes  EC  and  Odh,  using  the  data  in  Problem  2.3  (page  47).  A con- 
venient shortcut  to  obtaining  the  %2  value  is  first  to  calculate 

P = D/(plp2qlq2)V2 

by  substituting,  on  the  right-hand  side,  the  estimated  values  for  each 
of  the  parameters.  The  value  of  %2  is  numerically  equal  to  p2N,  where 
N is  the  total  number  of  chromosomes  examined.  The  biological 
meaning  of  p is  that  it  is  the  correlation  between  alleles  present  in  the 
same  chromosome. 


ANSWER  For  the  data  given  in  Problem  2.3,  the  observed  numbers 
of  the  chromosomal  types  ECf  OdhF,  ECf  Odhs,  ECS  OdhF,  and  ECS 
Odh~  were  416,  20,  44,  and  9,  respectively.  The  estimated  allele  fre- 
quencies of  ECf,  ECS,  Odlf,  and  Odhs  are  0.8916,  0.1084,  0.9407,  and 
0.0593,  respectively,  and  D = (416  x 9 - 20  x 44)/4892  = 0.0120.  Thus, 
p = 0.0120/(0.8916  x 0.1084  x 0.9407  x 0.0593)1/2  = 0.1631.  Conse- 
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quently,  y>2  = 0.16312  x 489  = 13.0  with  one  degree  of  freedom,  for 
which  the  associated  probability  is  0.0004.  Thus,  there  is  significant 
linkage  disequilibrium  between  these  genes.  The  value  of  Dmax  is  the 
smaller  of  0.053  and  0.102,  and  so  Dmax  = 0.053.  The  magnitude  of  the 
linkage  disequilibrium,  relative  to  its  theoretical  maximum,  is 
0.012/0.053  = 22.6%.  The  y2  can  also  be  calculated  from  the  expected 
numbers  of  the  four  gametic  types,  which  are  410.1,  25.9,  49.9,  and 
3.1,  respectively. 


PROBLEM  3.16  Use  the  formula  for  y2  in  Problem  3.15  to  evaluate 
the  statistical  significance  of  the  linkage  disequilibrium  between  alle- 
les of  the  gene  for  alcohol  dehydrogenase  in  Drosophila  tnelanogaster 
and  the  presence  or  absence  of  an  EcoRI  restriction  site  located  3500 
nucleotides  downstream.  The  data  are  from  a population  descended 
from  animals  trapped  at  a Dutch  fruit  market  in  Groningen  (Cross 
and  Birley  1986). 

Adhr  EcoRI  site  present:  22 

AdhF  EcoRI  site  absent:  3 

Adhs  EcoRI  site  present:  4 

Adhs  EcoRI  site  absent:  5 


ANSWER  D — 0.085  and  yf  — p2N  = 0.453“  x 34  = 7.0  with  one  degree 
of  freedom;  the  associated  probability  value  is  approximately  0.01. 
The  linkage  disequilibrium  is  statistically  significant  and  has  a value 
of  49%  of  its  maximum  possible  value. 


Linkage  disequilibrium  in  local  populations,  such  as  seen  in  the  preced- 
ing examples,  can  be  caused  by  linkage  disequilibrium  in  the  founding  pop- 
ulation that  has  not  yet  had  time  to  dissipate  due  to  the  small  value  of  r. 
Another  possible  cause  of  linkage  disequilibrium  is  admixture  of  popula- 
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tions  with  differing  gametic  frequencies.  A third  possibility  is  natural  selec- 
tion differentially  favoring  some  genotypes  over  others  to  such  an  extent  that 
it  overcomes  the  natural  tendency  for  D to  go  to  zero. 

Several  examples  in  which  linkage  disequilibrium  typically  is  present  in 
natural  populations  should  be  mentioned  here.  One  case  concerns  plants  that 
ordinarily  undergo  self-fertilization,  and  examples  are  discussed  in  Chapter 
4 in  connection  with  the  discussion  of  inbreeding.  Another  case  involves  cer- 
tain inversions  that  are  polymorphic  in  populations  of  certain  species  of 
Drosophila,  most  notably  D.  pseudoobscura  and  D.  subobscura  and  their  rela- 
tives. A chromosome  with  an  inversion,  as  the  name  implies,  has  a certain 
segment  of  its  genes  in  reverse  of  the  normal  order.  Because  of  the  inverted 
segment,  the  process  of  chromosome  breakage  and  reunion  in  meiosis  cannot 
be  completed  in  the  normal  manner,  with  the  result  that  the  alleles  in  the 
inverted  segment  are  usually  unaffected  by  recombination  and  so  they 
remain  linked  together.  Because  inversions  prevent  recombination,  each 
inversion  represents  a sort  of  "supergene,"  and  natural  selection  accumulates 
beneficially  interacting  alleles  within  each  inversion.  The  beneficially  inter- 
acting alleles  are  said  to  show  genetic  coadaptation. 

Linkage  disequilibrium  can  also  arise  as  an  artifact  of  admixture  of  sub- 
populations that  differ  in  allele  frequencies.  Organisms  that  are  subdivided 
into  local  populations  are  said  to  have  population  substructure.  An  example 
of  linkage  disequilibrium  arising  from  subpopulation  admixture  is  illustrat- 
ed in  Table  3.2.  In  this  example,  subpopulation  1 and  subpopulation  2 are 
both  in  linkage  equilibrium  for  the  alleles  of  the  A and  B genes.  Subpopula- 
tion 1 has  an  allele  frequency  of  0.05  for  both  A]  and  By  and  subpopulation  2 
has  an  allele  frequency  of  0.95  for  both  Ay  and  By.  An  equal  mixture  of  organ- 
isms from  both  subpopulations  has  the  gametic  frequencies  shown  in  the  last 
column  of  Table  3.2.  The  allele  frequencies  of  Ay  and  By  are  both  0.50  in  the 


TABLE  3.2 

LINKAGE  DISEQUILIBRIUM  FROM  ADMIXTURE 
OF  SUBPOPULATIONS 

Chromosome 

Frequency 

Subpopulation  1 

Subpopulation  2 

Equal  mixture 

Ay  By 

Pyy 

0.0025 

0.9025 

0.4525 

Ay  B2 

Pu 

0.0475 

0.0475 

0.0475 

A2  By 

P2y 

0.0475 

0.0475 

0.0475 

A2  B2 

P22 

0.9025 

0.0025 

0.4525 

D 

= PyyP22 ~ P 1: 

-P2i  0 

0 

0.2025 

^min 

-0.0025 

-0.0025 

-0.2500 

^max 

0.0475 

0.0475 

0.2500 
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mixture,  but  there  is  substantial  linkage  disequilibrium  between  the  alleles, 
as  shown  at  the  bottom  of  the  table.  In  the  mixed  population,  D equals  81%  of 
its  theoretical  maximum  value.  The  sole  cause  of  the  disequilibrium  is  the 
differing  allele  frequencies  in  the  subpopulations.  Furthermore,  the  consider- 
ations in  Table  3.2  make  no  assumption  that  A and  B are  on  the  same  chro- 
mosome, hence  linkage  disequilibrium  may  result  from  population  admixture 
even  for  genes  on  different  chromosomes.  If  subpopulations  become  perma- 
nently mixed  and  undergo  random  mating,  then  Equation  3.8  implies  that 
the  induced  linkage  disequilibrium  is  expected  to  decrease  at  the  rate  r per 
generation,  where  r is  the  recombination  fraction  between  the  A and  B genes. 
For  unlinked  genes,  r = y2. 

SUMMARY 

In  any  population,  the  genotype  frequencies  among  zygotes  are  determined 
in  large  part  by  the  patterns  in  which  genotypes  of  the  previous  generation 
come  together  to  form  mating  pairs.  In  random  mating,  genotypes  form  mat- 
ing pairs  in  the  proportions  expected  from  random  collisions.  For  a gene  with 
two  alleles  A and  a in  a random-mating  population,  the  expected  geno- 
type frequencies  of  A A,  Aa,  and  aa  are  given  by  p2,  2 pq,  and  q2,  respectively, 
where  p and  q are  the  allele  frequencies  ot  A and  a,  respectively,  with  p + q = 1. 
The  expected  genotype  frequencies  with  random  mating  constitute  the 
Flardy- Weinberg  equilibrium  (HWE).  The  rate  at  which  the  HWE  frequencies 
are  attained  depends  on  the  life  history  of  the  organism.  In  an  organism  with 
nonoverlapping  generations,  such  as  an  annual  plant,  each  generation  is  sep- 
arated in  time  from  the  preceding  and  the  following  generation;  in  this  case, 
the  Hardy- Weinberg  frequencies  are  attained  in  one  generation  of  random 
mating  provided  that  the  allele  frequencies  are  equal  in  the  sexes.  In  an 
organism  with  nonoverlapping  generations,  the  approach  to  HWE  is  gradual. 
Statistical  tests  of  HWE  are  often  based  on  the  y2  test,  but  this  test  is  relative- 
ly weak  in  detecting  departures  from  the  expected  frequencies,  especially 
those  caused  by  admixture  of  subpopulations  differing  in  allele  frequency. 

One  of  the  principal  implications  of  the  HWE  is  that  the  allele  frequencies 
and  the  genotype  frequencies  remain  constant  from  generation  to  generation, 
hence  genetic  variation  is  maintained.  Another  major  implication  is  that, 
when  an  allele  is  rare,  the  population  contains  many  more  heterozygotes  for 
the  allele  than  it  contains  homozygotes  for  the  allele. 

Extensions  of  the  HWE  include  multiple  alleles  and  X-linked  genes.  With 
multiple  alleles,  the  expected  frequency  of  a homozygous  genotype  A,Ai 
equals  p2,  and  the  expected  frequency  of  a heterozygous  genotype  A, A,  equals 
2 ppj,  where  p,  and  p!  are  the  allele  frequencies  of  A,  and  Ar  With  X-linked  alle- 
les, the  genotype  frequencies  in  females  (XX)  are  given  by  the  HWE  but  those 
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in  males  (XY)  are  given  by  the  allele  frequencies.  Consequently,  for  a recessive 
X-linked  mutation  with  allele  frequency  q,  the  proportion  of  affected  males  (q) 
always  exceeds  the  proportion  of  affected  females  (q2);  the  rarer  the  recessive 
allele,  the  greater  is  the  excess  of  affected  males. 

Nonrandom  association  between  the  alleles  of  different  genes  is  measured 
by  the  linkage  disequilibrium  parameter  D.  Random  association  between 
alleles  of  different  genes  is  called  linkage  equilibrium,  and  it  is  indicated  by 
D = 0.  When  D * 0,  the  alleles  are  said  to  be  in  linkage  disequilibrium.  Ordi- 
narily, unless  there  is  some  countervailing  process  that  maintains  linkage 
disequilibrium  between  two  genes,  D is  expected  to  go  to  zero  at  a rate  deter- 
mined by  the  recombination  fraction  between  the  genes.  For  unlinked  genes, 
D decreases  by  one-half  in  each  generation;  for  genes  that  recombine  with  a 
frequency  r,  D decreases  by  the  fraction  r in  each  generation.  Significant  link- 
age disequilibrium  is  usually  found  in  natural  populations  for  genes  that  are 
tightly  linked,  for  genes  that  are  within  or  near  an  inverted  segment  of  chro- 
mosome, or  for  genes  in  plant  species  that  regularly  undergo  self-fertilization. 
Significant  linkage  disequilibrium  can  also  result  from  admixture  of  two  or 
more  subpopulations  differing  in  allele  frequencies. 

PROBLEMS 

1.  Phenylketonuria  is  an  autosomal  recessive  form  of  severe  mental  retarda- 
tion. About  one  in  10,000  newborn  Caucasians  are  affected.  Assuming 
random  mating,  what  is  the  frequency  of  heterozygous  carriers? 

2.  Mourant  et  al.  (1976)  cite  data  on  400  Basques  from  Spain,  of  which  230 
were  Rh+  and  170  were  R/C.  Estimate  the  allele  frequencies  of  D and  d. 
How  many  of  the  Rh+  individuals  are  expected  to  be  heterozygous? 

3.  Kelus  (cited  in  Mourant  et  al.  1976)  reports  a study  of  3100  Poles,  of 
whom  1101  were  MM,  1496  were  MN,  and  503  were  NN.  Calculate  the 
allele  frequencies  and  the  expected  numbers  of  the  three  genotypes  and 
carry  out  a %2  test  for  goodness  of  fit  to  random-mating  proportions. 

4.  Consider  an  autosomal  gene  with  four  alleles  Au  A2,  A3,  and  A4  with 
respective  frequencies  0.1,  0.2,  0.3,  and  0.4.  Calculate  the  expected  geno- 
type frequencies  under  random  mating. 

5.  Show  that  the  proportion  of  heterozygous  offspring  from  a heterozygous 
parent  is  V2  in  a population  undergoing  random  mating  for  a single  gene 
with  two  alleles. 

6.  If  random  mating  with  two  alleles  gives  frequencies  D,  H,  and  R for 
homozygous  dominant,  heterozygote,  and  homozygous  recessive,  show 
that  DR  = H2/ 4. 

7.  When  mating  is  random  for  a gene  with  two  alleles  A and  a at  frequen- 
cies p and  q,  show  that  the  genotype  frequencies  of  AA,  Aa,  and  aa  are 
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approximately  1 - 2 q,  2 q,  and  0 when  q is  so  small  that  q2  is  approxi- 
mately 0. 

8.  In  a population  undergoing  random  mating  for  a single  gene  with  a dom- 
inant and  recessive  allele,  show  that  the  allele  frequency  of  the  recessive 
allele  among  individuals  with  the  dominant  phenotype  is  q/(  1 + q), 
where  q is  the  allele  frequency  of  the  recessive  in  the  whole  population. 

9.  The  frequency  of  one  form  of  recessive  X-linked  color  blindness  is  5% 
among  European  males.  What  is  the  expected  frequency  of  this  form  of 
color  blindness  among  females?  What  fraction  of  females  would  be  het- 
erozygous carriers? 

10.  For  a trait  due  to  a rare  X-linked  recessive  gene,  show  that  the  frequency 
of  carrier  females  is  approximately  equal  to  two  times  the  frequency  of 
affected  males. 

11.  What  is  the  analogue  of  the  Hardy- Weinberg  principle  for  a gene  with 
two  alleles  in  a tetraploid? 

12.  Given  the  following  table  of  allele  frequencies: 


Gene 


1 

2 

3 

4 

5 

Allele  1 

0.63 

0.94 

0.995 

1.0 

0.78 

Allele  2 

0.37 

0.06 

0.005 

— 

0.12 

Allele  3 

- 

- 

- 

— 

0.06 

Allele  4 

- 

- 

- 

- 

0.04 

What  is  the  proportion  (P)  of  polymorphic  genes  (using  the  definition  in 
the  text)?  Assuming  random  mating  and  linkage  equilibrium,  what  is  the 
average  heterozygosity  (H)  for  the  set  of  genes? 

13.  Charles  Darwin  could  have  discovered  segregation  had  he  known  what 
to  look  for,  as  Mendelian  segregation  occurred  in  at  least  one  of  his  own 
experiments.  Darwin  (cited  in  litis  1932)  studied  flower  shape  in  the  snap- 
dragon Antirrhinum.  In  a cross  between  a true-breeding  strain  with  regu- 
lar (peloric)  flowers  and  a true-breeding  strain  with  irregular  (normal) 
flowers,  all  of  the  F/s  were  normal.  Crosses  of  F:  x Fj  yielded  88  normal 
and  37  peloric  plants.  Perform  a y2  test  assuming  a 3 : 1 ratio  in  the  F2.  Is 
the  peloric  or  normal  allele  dominant? 

14.  For  a mating  between  triple  dominant/ recessive  heterozygotes  of  three 
unlinked  genes,  there  are  eight  phenotypic  classes  among  the  offspring. 
What  are  the  expected  phenotypic  ratios?  Mendel  carried  out  such  an 
experiment  and  obtained  the  phenotypic  ratio  269  : 98  : 86  : 88  : 30  : 34  : 
27  : 7 among  a total  of  639  progeny.  (He  complained  that  this  experiment 
required  the  most  time  and  effort  of  any  of  his  crosses.)  Calculate  the  y2 
and  associated  probability. 
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15.  If  one  gene  has  alleles  A\  and  A2  at  frequencies  px  and  pv  and  another 
gene  has  alleles  Bi,  B2,  and  B3  at  frequencies  qi,  q2,  and  q3,  what  are  the 
expected  frequencies  of  gametes  with  linkage  equilibrium  assuming  that 
Pi  = 0.3,  qi  = 0.2,  and  q2  = 0.3? 

16.  For  two  genes  with  alleles  A\  and  A2  and  Bi  and  B2,  respectively,  with  p3 
and  p2  the  allele  frequencies  of  A1  and  A2,  and  q:  and  q2  those  of  and  B2, 
let  pi  = 0.7  and  ql  = 0.3. 

a.  What  are  the  frequencies  of  all  possible  gametes  assuming  linkage 
equilibrium? 

b.  What  are  the  frequencies  of  all  possible  gametes  if  there  is  linkage  dis- 
equilibrium with  D equal  to  50%  of  its  theoretical  maximum? 

17.  Use  the  result  in  Problem  8 to  show  that  the  frequency  of  homozygous 
recessive  genotypes  from  dominant  x dominant  matings  is  [<7/(1  + q)]2 
and  from  dominant  x recessive  matings  is  17/(1  + q).  Note  that  the  latter  is 
equal  to  the  square  root  of  the  former.  (These  proportions  are  called 
Snyder's  ratios  and  were  once  used  to  test  traits  for  simple  recessive 
inheritance.) 


CHAPTER  4 


Population  Substructure 


Hierarchical  Structure  F Statistics  Wahlund  Effect 
DNA  Typing  Assortative  Mating  Inbreeding 
Inbreeding  Coefficient 


p 


opulation  substructure  is  almost  universal  among  organisms. 
Many  organisms  naturally  form  subpopulations  in  the  form  of 
herds,  flocks,  schools,  colonies,  or  other  types  of  aggregations.  In 
addition,  natural  habitats  are  typically  patchy,  with  favorable  areas  inter- 
mixed with  unfavorable  areas.  Through  time,  even  uniformly  favorable  areas 
can  be  disrupted  by  floods,  fires,  or  other  perils.  When  there  is  population 
subdivision,  there  is  almost  inevitably  some  genetic  differentiation  among 
the  subpopulations.  By  genetic  differentiation  we  mean  the  acquisition  of 
allele  frequencies  that  differ  among  the  subpopulations.  Genetic  differentia-  • 
tion  may  result  from  natural  selection  favoring  different  genotypes  in  differ- 
ent subpopulations,  but  it  may  also  result  from  random  processes  in  the 
transmission  of  alleles  from  one  generation  to  the  next  or  from  chance  differ- 
ences in  allele  frequency  among  the  initial  founders  of  the  subpopulations. 
This  chapter  considers  some  of  the  consequences  of  population  subdivision 
as  well  as  other  types  of  nonrandom  mating. 


HIERARCHICAL  POPULATION  STRUCTURE 

A population  is  said  to  have  a hierarchical  population  structure  if  the  sub- 
populations can  be  grouped  into  progressively  inclusive  levels  in  which,  at 
each  grouping,  the  next  lower  levels  are  included  ("nested")  within  the  next 
higher  ones.  To  consider  a concrete  example,  imagine  we  were  interested  in 
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the  population  structure  of  a widespread  species  of  freshwater  fish.  The  low- 
est population  level  consists  of  a local  interbreeding  population  of  animals 
within  a stream.  A stream  may  contain  more  than  one  such  local  population. 
The  next-higher  level  in  the  hierarchy  may  be  the  organization  of  streams 
into  groups  feeding  the  same  river.  Another  higher  level  may  be  rivers  with- 
in watersheds.  An  even  higher  level  of  organization  may  be  watersheds 
within  continents.  The  aggregation  of  subpopulations  into  progressively 
more  inclusive  groups  may  continue  for  as  many  levels  as  is  convenient  and 
informative.  It  is  inevitably  somewhat  arbitrary  how  the  groups  at  each  level 
are  combined  to  form  the  next  higher  level  in  the  hierarchy.  The  objective  of 
the  classification  is  informativeness:  one  tries  to  group  the  subpopulations  in 
such  a way  as  to  highlight  the  genetic  similarities  and  differences  among 
them.  If  there  were  so  much  migration  of  fish  among  subpopulations  that  all 
members  of  the  species  constituted  essentially  a single,  random-mating  pop- 
ulation, then  there  would  be  no  need  to  define  a hierarchical  population 
structure  because  it  would  be  uninformative.  However,  most  organisms  do 
have  significant  population  substructure. 

Reduction  in  Heterozygosity 

One  of  the  important  consequences  of  population  substructure  is  a reduction 
in  the  average  proportion  of  heterozygous  genotypes  relative  to  that  expect- 
ed under  random  mating.  The  reason  for  the  reduction  in  heterozygosity 
may  be  understood  by  considering  the  hypothetical  example  in  Figure  4.1. 
The  outline  is  the  floor  plan  of  a large  barn.  The  organisms  of  interest  are  the 
mice  concentrated  primarily  into  two  subpopulations  of  equal  size  at  the 
west  and  east  ends  of  the  barn.  The  movement  of  mice  between  the  subpop- 
ulations is  prevented  by  a large  population  of  hungry  and  vigilant  cats  in  the 
central  area.  The  occasional  mouse  that  comes  out  of  its  refuge  is  quickly 
eaten.  (These  hypothetical  mice  have  not  been  endowed  with  the  ingenuity 
to  find  alternative  routes  between  the  west  and  east  ends  of  the  barn,  like 
sneaking  along  the  rafters.)  Because  of  chance  effects  in  the  founding  of  the 
subpopulations,  the  west  and  east  subpopulations  are  completely  homozy- 
gous for  alternative  alleles  of  a gene.  All  the  mice  in  the  west  subpopulation 
are  AA,  and  all  those  in  the  east  subpopulation  are  aa.  In  technical  terms,  the 
west  subpopulation  is  fixed  for  the  A allele  (its  allele  frequency  equals  1), 
and  the  east  subpopulation  is  fixed  for  the  a allele.  The  genotype  frequencies 
of  A A,  Aa,  and  aa  in  the  west  subpopulation  are  1,  0,  and  0,  respectively,  and 
those  in  the  east  subpopulation  are  0,  0,  and  1,  respectively.  Within  each  sub- 
population there  is  random  mating,  and  the  genotype  frequencies,  though 
extreme,  still  satisfy  the  Hardy-Weinberg  principle.  In  particular,  the 
frequencies  of  AA,  Aa,  and  aa  within  each  subpopulation  are  given  by  p2, 
2 pq,  and  cj2,  where  p = 0 in  the  east  subpopulation,  and  p = 1 in  the  west 
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Figure  4.1  An  extreme  example  of  the  general  principle  that  a difference  in 
allele  frequency  among  subpopulations  results  in  a deficiency  of  heterozygotes. 
The  floor  plan  is  that  of  a hypothetical  barn.  The  mouse  subpopulations  in  the 
east  and  west  enclaves  are  completely  isolated  owing  to  the  cats  in  the  middle. 
The  west  subpopulation  is  fixed  for  the  A allele  and  the  east  subpopulation  for 
the  a allele.  Trapping  mice  at  random  in  the  area  patrolled  by  the  cats  would 
yield  an  overall  allele  frequency  of  V2  but  no  heterozygotes. 


subpopulation.  Therefore,  within  any  one  of  the  subpopulations  in  Figure  4.1, 
the  frequency  of  heterozygotes  equals  the  frequency  expected  with  HWE. 

The  situation  regarding  the  total  population  in  Figure  4.1  is  very  different, 
however,  as  there  is  an  overall  deficiency  of  heterozygotes.  By  "total  popula- 
tion" in  this  context,  we  mean  the  aggregate  of  all  mice  without  regard  to  the 
population  substructure.  Suppose  we  were  unaware  of  the  population  sub- 
structure in  the  barn.  We  might  then  suppose  that  the  barn  contained  a single 
randomly  mating  population.  To  study  the  total  population  of  the  barn,  we 
trap  mice  at  random  in  the  center  area,  catching  the  occasional  escapee  from 
the  cats.  Because  the  subpopulations  are  fixed  for  either  A or  a,  half  the  time 
we  would  trap  an  AA  homozygote  and  half  the  time  an  aa  homozygote.  Con- 
sequently, we  estimate  the  allele  frequency  of  A as  p = V2.  Assuming  random 
mating  and  Hardy-Weinberg  genotype  frequencies  in  the  total  population, 
the  expected  genotype  frequencies  of  AA,  Aa,  and  aa  are  given  by  the  HWE 
as  p1 , 2 p cj , and  cf.  Because  the  overall  allele  frequency  of  A among  the 
trapped  animals  is  y2,  we  would  naively  expect  a fraction  2 x '/2  x y2  = i/2  of 
the  animals  to  be  heterozygous.  In  fact,  we  would  have  caught  no  heterozy- 
gotes at  all! 
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This  rather  paradoxical  result — that  there  is  a deficiency  of  heterozygotes 
in  the  total  population  even  though  random  mating  takes  place  within  each 
subpopulation — is  a consequence  of  the  difference  in  allele  frequency  among 
the  subpopulations.  Were  the  allele  frequencies  in  both  subpopulations  the 
same,  it  would  not  matter  whether  we  sampled  from  the  west  subpopulation, 
the  east  subpopulation,  or  from  the  area  in  between.  We  would  recover  geno- 
types in  Hardy- Weinberg  proportions  because  both  subpopulations  are  geno- 
typically identical  and  in  HWE.  In  an  organism  with  hierarchically 
structured  subpopulations,  there  is  an  analogous  deficiency  of  heterozygotes 
at  each  level  in  the  hierarchy.  The  following  section  examines  the  heterozy- 
gosities in  more  detail. 

Average  Heterozygosity 

In  the  Mohave  desert,  local  populations  of  the  annual  plant  Linanthus  parryae 
are  polymorphic  for  white  versus  blue  flowers.  The  plant  is  diminutive,  aver- 
aging just  1 cm  in  height,  and  when  the  plant  is  in  bloom,  the  ground  cover 
of  white  flowers  justifies  the  popular  name  "desert  snow."  Blue  flowers 
result  from  homozygosity  for  a recessive  allele.  The  geographical  distribu- 
tion of  the  frequency  q of  the  recessive  allele  across  a region  of  the  Mohave 
desert  is  illustrated  in  Figure  4.2.  Each  allele  frequency  is  based  on  an  exam- 
ination of  approximately  4000  plants  over  an  area  of  about  30  square  miles 
(Epling  and  Dobzhansky  1942). 

Judging  from  the  allele-frequency  map  in  Figure  4.2,  the  highest  frequen- 
cies of  the  blue-flower  allele  are  largely  concentrated  at  the  west  and  east 
ends  of  the  region  in  question.  The  unequal  allele  frequencies  across  the 
range  imply  a decrease  in  average  heterozygosity,  relative  to  HWE,  analo- 


Figure  4.2  Estimated  frequency  of  a recessive  allele  for  blue  flower  color  in 
populations  of  Linanthus  parryae  in  an  area  of  approximately  900  square  miles  in 
the  Mohave  desert.  Each  allele  frequency  is  based  on  an  examination  of  approxi- 
mately 4000  plants  over  an  area  of  about  30  square  miles.  (After  Wright  1943a.) 
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gous  to  the  mouse  example  in  Figure  4.1,  though  not  as  extreme.  Figure  4.2 
shows  the  estimated  allele  frequency  in  each  of  30  subpopulations.  Suppose 
each  of  the  subpopulations  is  regarded  as  a random-mating  unit  in  FFWE  for 
the  flower-color  alleles.  The  average  heterozygosity  among  the  subpopula- 
tions can  be  denoted  as  Hs,  where  the  subscript  indicates  subpopulation.  The 
calculations  are  shown  in  the  third  column  in  Table  4.1;  the  heterozygosity  in 
each  subpopulation  is  calculated  as  2 pq,  where  p and  q are  the  estimated 
frequencies  of  the  alleles  for  white  versus  blue  flower  color,  respectively,  in 
each  subpopulation.  The  Hs  tabulated  at  the  bottom  is  the  average  of  all  the 


TABLE  4. 1 

HIERARCHICAL  STRUCTURE  OF  LINANTHUS  PARRYAE 

Region 

Subpopulations 

Regions 

Total 

Allele 

frequency 

Heterozygosity 

Average  allele 
frequency 

Heterozygosity 

Average  allele 
frequency 

Heterozygosity 

W 

0.573 

0.4893 

0.717 

0.4058 

0.504 

0.5000 

0.657 

0.4507 

0.302 

0.4216 

0.339 

0.4482 

0.5153 

0.4995 

c 

9 x 0.000 

0.0000 

0.032 

0.0620 

0.007 

0.0139 

0.008 

0.0159 

0.005 

0.0100 

0.009 

0.0178 

0.005 

0.0100 

0.010 

0.0198 

0.068 

0.1268 

0.002 

0.0040 

0.004 

0.0080 

0.126 

0.2202 

0.0138 

0.0272 

E 

0.106 

0.1895 

0.224 

0.3476 

0.411 

0.4842 

0.014 

0.0276 

0.1888 

0.3062 

0.1374 

0.2371 

Average 

heterozygosity 

Hs  = 0.1424 

Hr  = 0.1589 

Hj  = 0.2371 

Source:  Data  from  Wright  1943a. 
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subpopulation  heterozygosities  (counting  the  value  0.000  a total  of  nine  times 
because  of  the  nine  different  subpopulations  in  which  q = 0.000). 

A second  hierarchical  level  of  population  substructure  is  that  of  region- 
west  (W),  central  (C),  or  east  (E).  To  calculate  the  heterozygosity  expected 
from  HWE  in  each  region,  we  first  estimate  the  average  allele  frequency  in 
the  region  by  taking  the  mean  allele  frequency  across  all  subpopulations  in 
t e region.  For  example,  the  average  allele  frequency  q in  region  E is  (0.106  + 
0.224  + 0.411  + 0.014)/4  = 0.1888.  In  each  region,  the  heterozygosity  expected 
from  HWE  is  calculated  as  2 pq,  where  p and  q are  the  average  allele  frequen- 
cies in  the  region.  In  region  E,  therefore,  the  regional  heterozygosity  equals 
2 x (1  - 0.1888)  x 0.1888  = 0.3062.  The  average  heterozygosity  within  regions 
at  the  bottom  of  column  5 is  denoted  HR ; it  is  the  weighted  average  of  the 
regional  heterozygosities,  where  each  regional  heterozygosity  is  weighted  by 
the  number  of  subpopulations  in  the  region.  In  this  example,  HR  = (6  x 0 4995 
+ 20x0.0272 + 4x0.3062)730  = 0.1589. 


Yet  another  hierarchical  level  of  population  substructure  in  Figure  4.2  is 
the  total  population— the  aggregate  population  obtained  by  conceptually 
uniting  all  subpopulations  to  form  a single  random  mating  unit.  The  average 
allele  frequency  is  the  mean  allele  frequency  across  all  subpopulations,  and 
q - 0.1374.  Then  HT  is  calculated  as  2 pq  = 2 x 0.8626  x 0.1374  = 0.2371. 

To  sum  up: 


• Hs  is  the  average  HWE  heterozygosity  among  organisms  within  random- 
mating subpopulations. 

• Hr  is  the  average  HWE  heterozygosity  among  organisms  within  regions. 

• Hj  is  the  average  HWE  heterozygosity  among  organisms  within  the  total 
area. 


The  concepts  of  hierarchical  population  structure  and  the  various  levels  of 
heterozygosity  were  originally  developed  by  Sewall  Wright  (1889-1988)  to 
quantify  genetic  differences  among  subgroups  at  the  various  levels;  he  called 
his  theory  isolation  by  distance  (Wright  1943a,  1943b).  The  motivation  for 
developing  such  a method  was  summarized  in  the  following  passage.  The 
term  panmixia  is  a synonym  for  random  mating. 

Study  of  statistical  differences  among  local  populations  is  an  important  line  of 
attack  on  the  evolutionary  problem.  While  such  differences  can  only  rarely 
represent  first  steps  toward  speciation  in  the  sense  of  the  splitting  of  the 
species,  they  are  important  for  the  evolution  of  the  species  as  a whole.  They 
provide  a possible  basis  for  intergroup  selection  of  genetic  systems,  a process 
that  provides  a more  effective  mechanism  for  adaptive  advance  of  the  species 
as  a whole  than  does  the  mass  selection  which  is  all  that  can  occur  under  pan- 
mixia.  r 

Furthermore,  the  reduction  in  heterozygosity  resulting  from  population 
substructure  is  intimately  related  to  the  reduction  in  heterozygosity  caused 
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by  inbreeding — mating  between  relatives — as  we  shall  see  later  in  this  chap- 
ter. Indeed,  the  relation  of  population  substructure  to  inbreeding  can  be 
understood  by  interpreting  each  subpopulation  as  a sort  of  "extended  fami- 
ly" or  set  of  interconnected  pedigrees.  Organisms  in  the  same  subpopulation 
will  often  share  one  or  more  recent  or  remote  common  ancestors,  and  so  a 
mating  between  organisms  in  the  same  subpopulation  will  often  be  a mating 
between  relatives.  The  larger  the  subpopulation,  and  the  more  recently  it  has 
been  isolated,  the  smaller  this  inbreeding  effect;  nevertheless  the  analogy  to 
inbreeding  is  valid. 


Wright's  F Statistics 

To  quantify  the  inbreeding  effect  of  population  substructure,  Wright  (1921) 
defined  what  has  come  to  be  called  the  fixation  index.  This  index  equals  the 
reduction  in  heterozygosity  expected  with  random  mating  at  any  one  level 
of  a population  hierarchy  relative  to  another,  more  inclusive  level  of  the  hier- 
archy. The  fixation  index  is  a useful  index  of  genetic  differentiation  because 
it  allows  an  objective  comparison  of  the  overall  effect  of  population  sub- 
structure among  different  organisms  without  getting  into  details  of  allele  fre- 
quencies, observed  levels  of  heterozygosity,  and  so  forth.  The  genetic  symbol 
for  a fixation  index  is  F embellished  with  subscripts  denoting  the  levels  of 
the  hierarchy  being  compared.  For  example,  FSR  is  the  fixation  index  of  the 
subpopulations  relative  to  the  regional  aggregates: 


rSR 


Hr-Hs 

Hr 


4.1 


In  words.  Equation  4.1  defines  FSR  as  the  decrease  of  heterozygosity 
among  subpopulations  within  regions  (HR  - Hs),  relative  to  the  heterozygos- 
ity among  regions  (Hr).  For  the  Linanthus  example  in  Table  4.1,  Fsr  = (0.1589 
- 0.1424)/0.1589  = 0.1036. 

At  the  next  level  of  the  hierarchy,  we  may  define  the  fixation  index  FRT  as 
the  proportionate  reduction  in  heterozygosity  of  the  regional  aggregates  rel- 
ative to  the  total  combined  population: 


Frt  - 


Hj  - Hr 

Ht 


4.2 


The  data  in  Table  4.1  shows  that  FRT  = (0.2371  - 0.1589)/0.2371  = 0.3299. 
Comparison  of  this  value  with  FSR  above  already  makes  it  clear  that  there  is 
substantially  more  variation  among  regions  (as  measured  by  FRT)  as  there  is 
among  subpopulations  within  regions  (as  measured  by  FSR).  The  comparison 
of  the  fixation  indices  at  the  two  levels  gives  quantitative  expression  to  the 
regional  differences  apparent  in  Figure  4.2. 


118 


Chapter  4 


The  fixation  index  Fsx  compares  the  least  inclusive  to  the  most  inclusive 
levels  of  the  population  hierarchy  and  measures  all  effects  of  population  sub- 
structure combined: 


f"ST  - 


ht-hs 

ht 


4.3 


From  Table  4.1,  Fsx  = (0.2371  - 0.1424)/0.2371  = 0.3993.  The  overall  reduc- 
tion in  average  heterozygosity  is  therefore  close  to  40%  of  the  total  heterozy- 
gosity— a very  substantial  effect. 

The  hierarchical  F-statistics  defined  in  Equations  4.1  through  4.3  are  all 
types  of  fixation  indices,  but  they  differ  in  the  reference  populations:  FSR  is 
concerned  with  subpopulations  (S)  relative  to  the  regional  aggregates  (R),  FRT 
is  concerned  with  the  regional  groupings  relative  to  the  total  population  (T), 
and  Fst  is  concerned  with  the  subpopulations  relative  to  the  total  population. 
The  index  FST  is  the  most  inclusive  measure  of  population  substructure.  The 
mathematical  relation  between  the  three  types  of  F statistics  is  demonstrated 
in  the  following  problem. 


PROBLEM  4.1  Show  that  FSR,  FRT,  and  FST  are  related  by  the  equation 
(l-FSR)x(l-FRT)  = l-FST 


ANSWER  From  Equation  4.1,  FSR  = 1 - (HS/HR),  or  1 - FSR  = HS/HR. 
Equation  4.2  implies  that  fRT  = 1 - (HR/HT),  or  1 - FRT  = HR/HT.  Final- 
ly, Equation  4.3  implies  that  Fst  = 1 - (Hs/Ht),  or  1 - FSI  = Hs/Hx. 
Now  multiply  the  expressions  for  1 - FSR  and  1 - FRT  together  to 
obtain  (1  - Fsr)  x (1  - FKr)  = (HS/HR)  x (HR/HT)  = Hs/Hx  = (1  - Fsx). 


For  examining  the  overall  level  of  genetic  divergence  among  subpopula- 
tions, Fsx  is  the  informative  statistic.  Although  Fsx  has  a theoretical  minimum 
of  0 (indicating  no  genetic  divergence)  and  a theoretical  maximum  of  1 (indi- 
cating fixation  for  alternative  alleles  in  different  subpopulations),  the 
observed  maximum  is  usually  much  less  than  1.  Wright  (1978)  has  suggest- 
ed the  following  qualitative  guidelines  for  the  interpretation  of  Fsx: 

• The  range  0 to  0.05  may  be  considered  as  indicating  little  genetic  differen- 
tiation. 
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• The  range  0.05  to  0.15  indicates  moderate  genetic  differentiation. 

• The  range  0.15  to  0.25  indicates  great  genetic  differentiation. 

• Values  of  Fsj  above  0.25  indicate  very  great  genetic  differentiation. 

On  the  other  hand,  Wright  also  notes  that,  among  subpopulations,  "dif- 
ferentiation is  by  no  means  negligible  if  FST  is  as  small  as  0.05  or  even  less." 


PROBLEM  4.2  Some  subpopulations  of  Drosophila  melanogaster 
show  an  altitudinal  gradient  in  the  allozymes  of  alcohol  dehydroge- 
nase in  which  the  frequency  of  the  Adh-F  allele  increases  with  alti- 
tude. The  data  in  the  accompanying  table  are  estimates  of  the  allele 
frequency  of  Adh-F  in  seven  samples  of  adult  flies  captured  either  in 
the  mountains,  in  the  foothills,  or  on  the  plains  of  the  Caucasus 
Mountains  of  the  former  Soviet  Union.  Each  allele  frequency  is  based 
on  electrophoresis  of  approximately  300  adult  flies  (Grossman  et  al. 
1970).  Calculate  the  F statistics  FSE  (subpopulations  within  elevations), 
FEt  (elevations  within  the  total),  and  FST  (subpopulations  relative  to 
the  total).  What  do  the  magnitudes  of  the  F statistics  suggest  regard- 
ing  genetic  differentiation  among  subpopulations  in  the  frequency  of 
Adh-F  with  respect  to  altitude? 

Allele  Allele  Allele 

Elevation  frequency  Elevation  frequency  Elevation  frequency 

Mountain  0.321  Foothill  0.131  Plain  0.082 

Mountain  0.226  Foothill  0.109  Plain  0.088 

Plain  0.035 


ANSWER  Let  p represent  the  allele  frequency  of  Adh-F.  For  each 
subpopulation,  the  HWE  heterozygosity  equals  2p(l  - p),  which  for 
the  seven  samples  are  0.4359  and  0.3498  (mountain),  0.2277  and  0.1942 
(foothill),  and  0.1506,  0.1605,  and  0.0676  (plain).  The  average  of  these 
values  is  Hs,  which  equals  0.2266.  At  each  of  the  elevations,  the  aver- 
age allele  frequency  is  the  mean  across  the  subpopulations  sampled  at 
that  elevation.  For  mountain,  foothill,  and  plain,  these  means  equal 
0.274,  0.120,  and  0.068,  respectively,  yielding  the  elevation  HWE  het- 
erozygosities  0.3974,  0.2112,  and  0.1273,  respectively.  (Your  results 
may  differ  slightly  according  to  the  number  of  significant  digits  you 
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carry  along.)  The  average  of  the  elevation  heterozygosities  equals  the 
mean  elevation  heterozygosity  (HE),  and  it  is  the  weighted  average 
(2  x 0.3974  + 2 x 0.2112  + 3 x 0.1273)/7  = 0.2285.  Finally,  the  allele  fre- 
quency for  the  total  heterozygosity  is  equal  to  the  mean  allele  fre- 
quency across  subpopulations,  which  is  0.142,  yielding  a total  HWE 
heterozygosity  (HT)  of  0.2433.  The  F statistics  are  FSE  = (HE  - Hs) /HE  = 
0.0081,  Fet  = (Ht  - He)/Ht  = 0.0609,  and  FST  = (HT  - HS)/HT  = 0.0684. 
[As  a check,  note  that  (1  - FSE)  x (1  - FET)  = 1 - FST.]  Judging  from  the 
magnitudes  of  the  F statistics,  it  is  clear  that  most  of  the  differentia- 
tion among  subpopulations  is  correlated  with  altitude;  there  is  very 
little  genetic  differentiation  among  subpopulations  at  each  elevation. 


The  method  of  estimating  the  F statistics  by  replacing  the  parameters  in 
Equations  4.1  through  4.3  with  their  observed  or  estimated  values  is  not  nec- 
essarily the  best,  particularly  with  small  samples.  Ideally,  estimates  of  the  F 
statistics  should  correct  for  the  effects  of  sampling  a limited  number  of  sub- 
populations, as  well  as  for  the  effects  of  sampling  a limited  number  of  organ- 
isms in  each  subpopulation.  Methods  for  making  these  corrections  have  been 
suggested  but  are  quite  complex  and  raise  additional  issues.  For  an  excellent 
discussion,  see  Weir  and  Cockerham  (1984).  Important  issues  are  also 
addressed  in  Wright  (1978,  pp.  86-89),  Curie-Cohen  (1982),  Nei  and  Chesser 
(1983),  and  Nei  (1986).  We  will  use  the  uncorrected  estimation  procedure, 
which  is  adequate  for  purposes  of  illustration. 

Genetic  Divergence  among  Subpopulations 

The  fixation  index  FST  defined  in  Equation  4.3  serves  as  a convenient  and 
widely  used  measure  of  genetic  differences  among  subpopulations.  The 
identification  of  the  causes  underlying  a particular  value  of  FST  observed  in 
a natural  population  is  often  difficult.  Allele  frequencies  among  subpopula- 
tions can  become  different  because  of  random  processes  (random  genetic 
drift)  as  well  as  by  natural  selection  with  complications  from  migration 
among  the  subpopulations.  Difficulties  in  the  assignment  of  cause  do  not, 
however,  invalidate  the  usefulness  of  FST  as  an  index  of  genetic  differentia- 
tion. 

The  levels  of  genetic  divergence  among  human  subpopulations  and 
among  subpopulations  of  several  other  species  are  presented  in  Table  4.2.  The 
values  of  FST  imply  that  genetic  divergence  between  human  subpopulations 
is  quite  small.  Of  the  total  genetic  variation  found  in  three  major  races  (Cau- 
casoid, Negroid,  and  Mongoloid),  only  7%  (0.07)  is  ascribable  to  genetic 
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TABLE  4.2  TOTAL  HETEROZYGOSITY  (HT),AVERAGE  HETEROZYGOSITY 
AMONG  SUBPOPULATIONS  (Hs),  AND  FIXATION  INDEX  (fST) 
FOR  VARIOUS  ORGANISMS 


Organism 

Number  of 
populations 

Number 
of  loci 

Hr 

H s 

F ST 

Human 

(major  races) 

3 

35 

0.130 

0.121 

0.069 

Human,  Yanomama 

Indian  villages 

37 

15 

0.039 

0.036 

0.077 

House  mouse 
(Mus  musculus) 

4 

40 

0.097 

0.086 

0.113 

Jumping  rodent 
(Dipodomys  ordii) 

9 

18 

0.037 

0.012 

0.676 

Drosophila 

equinoxialis 

5 

27 

0.201 

0.179 

0.109 

Horseshoe  crab 
(Limulus) 

4 

25 

0.066 

0.061 

0.076 

Lycopod  plant 

(Lycopodium  lucidulum) 

4 

13 

0.071 

0.051 

0.282 

Source:  Protein  electrophoretic  data  from  Nei  1975. 


differences  among  races.  About  93%  of  the  total  genetic  variation  is  found 
within  races.  Similarly,  of  the  total  genetic  variation  found  in  the  native 
Yanomama  Indians  of  Venezuela  and  Brazil,  only  7.7%  (0.077)  is  due  to  dif- 
ferences in  allele  frequency  among  villages.  This  result  implies  that  92.3%  of 
the  total  genetic  variation  is  found  within  any  single  village.  Values  of  FST 
for  other  organisms  are  quite  variable,  presumably  because  FST  is  influenced 
by  the  size  of  the  subpopulations — which  is  a major  determinant  of  the  mag- 
nitude of  random  changes  in  allele  frequency — by  the  amount  and  pattern  of 
migration  between  subpopulations,  and  by  other  factors,  including  natural 
selection. 

Table  4.2  provokes  a brief  discussion  of  the  sensitive  term  race  because  the 
term  is  prone  to  misunderstanding  or  misuse.  In  population  genetics,  a race 
is  a group  of  organisms  in  a species  that  are  genetically  more  similar  to  each 
other  than  they  are  to  the  members  of  other  such  groups.  Populations  that 
have  undergone  some  degree  of  genetic  divergence  as  measured  by,  for 
example,  Fs j,  therefore  qualify  as  races.  Using  this  definition,  the  human 
population  contains  many  races.  Each  Yanomama  village  represents,  in  a cer- 
tain sense,  a separate  "race,"  and  the  Yanomama  as  a whole  also  form  a dis- 
tinct "race."  Such  fine  distinctions  are  rarely  useful,  however.  It  is  usually 
more  convenient  to  group  populations  into  larger  units  that  still  qualify  as 
races  in  the  definition  given.  These  larger  units  often  coincide  with  races 
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based  on  physical  characteristics  such  as  skin  color,  hair  color,  hair  texture, 
facial  features,  and  body  conformation.  Contemporary  anthropologists  tend 
to  avoid  "race"  as  a descriptive  term  for  human  groups  because  cultural  and 
linguistic  differences,  which  are  also  important,  are  often  discordant  with 
genetic  differences  and  sometimes  discordant  with  each  other. 

Here  it  must  be  pointed  out  that  the  data  in  Table  4.2,  which  indicate 
much  more  genetic  variation  within  than  among  human  races,  may  be  mis- 
leading. The  conclusion  is  based  primarily  on  genes  determining  allozymes, 
and  it  certainly  is  not  true  for  genes  influencing  skin  color,  hair  color,  hair  tex- 
ture, and  other  traits  that  most  people  think  of  in  connection  with  the  word 
"race."  However,  skin  color  and  other  prominent  racial  characteristics  are 
used  to  delineate  races  precisely  because  racial  differences  for  these  traits  are 
rather  large,  so  the  genes  involved  cannot  be  representative  of  the  entire 
genome.  On  the  other  hand,  allozyme  loci  may  not  be  very  representative  of 
the  genome  either.  See  Nei  and  Roychoudhury  (1982)  for  a review  of  the 
genetic  relationship  and  evolution  of  human  races. 


ISOLATE  BREAKING:  THE  WAHLUND  PRINCIPLE 

The  flip  side  of  the  coin  of  heterozygosity  is  homozygosity  because  a diploid 
organism  that  is  not  heterozygous  must  be  homozygous.  Mathematically, 
homozygosity  = 1 - heterozygosity.  Therefore,  a corollary  of  the  deficit  in  aver- 
age heterozygosity,  relative  to  HWE,  that  results  from  population  substruc- 
ture is  that  there  is  an  equal  excess  in  average  homozygosity.  If  the  popula- 
tion substructure  is  eliminated  and  the  former  subpopulations  undergo  ran- 
dom mating,  the  average  homozygosity  decreases,  and  the  average  het- 
erozygosity increases  by  an  equal  amount.  The  phenomenon  that  the  aver- 
age homozygosity  decreases  when  subpopulations  join  together  is  called 
isolate  breaking  or  the  Wahlund  principle,  after  the  Swedish  statistician 
and  human  geneticist  Sten  Gosta  William  Wahlund  (1901-1976)  who  first 
described  the  effect  (Wahlund  1928). 

The  subpopulations  of  hypothetical  mice  in  Figure  4.1  afford  an  illus- 
tration of  the  Wahlund  principle.  As  long  as  the  cats  keep  the  subpopulations 
separate,  the  homozygosity  equals  1 because  the  west  subpopulation  is  geno- 
typically AA  and  the  east  subpopulation  is  genotypically  aa.  If  the  cats  were 
to  disappear  and  the  subpopulations  of  mice  came  together  and  practiced 
random  mating,  the  genotype  frequencies  would  be  V4  AA,  '/2  Aa,  and  % aa. 
The  homozygosity  in  the  fused  population  is  V4  + V4  = V2,  which  is  a substan- 
tial decrease  over  the  average  in  the  subpopulation  prior  to  fusion  and  ran- 
dom mating.  Not  only  is  the  total  homozygosity  reduced  by  population 
fusion,  so  is  the  average  frequency  of  each  homozygous  genotype.  Consider 
aa,  for  example.  Prior  to  fusion,  the  average  frequency  of  aa  across  both  sub- 
population equals  V2;  after  fusion  and  random  mating,  the  frequency  of  aa 
equals  V4. 
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In  human  population  genetics,  the  Wahlund  principle  is  usually  cited  for 
its  implication  that  fusion  of  subpopulations  results  in  a decrease  in  the  aver- 
age frequency  of  children  born  with  a genetic  disease  resulting  from 
homozygosity  for  a rare  recessive  allele,  particularly  an  allele  with  a relative- 
ly high  frequency  in  one  of  the  subpopulations.  Examples  of  harmful  reces- 
sive alleles  at  high  frequency  in  some  human  subpopulations  include,  in 
Caucasians,  the  alleles  for  oq-antitrypsin  deficiency  (q  = 0.024)  and  cystic 
fibrosis  (q  = 0.022);  in  blacks,  sickle-cell  anemia  ( q = 0.05  in  American  blacks, 
up  to  q = 0.1  in  some  African  populations);  in  the  Hopi  and  some  other  South- 
west American  Indian  tribes,  albinism  (q  = 0.07);  and,  in  Ashkenazi  Jews,  Tay- 
Sachs  disease  (q  s 0.013). 

The  Wahlund  principle  for  a recessive  allele  in  two  subpopulations  is 
illustrated  in  Figure  4.3A.  The  west  subpopulation  has  allele  frequency  q1  and 
genotype  frequency  q\)  the  east  subpopulation  has  allele  frequency  q2  and 
genotype  frequency  q\.  The  average  frequency  of  the  homozygous  recessive 


(A)  Separate  subpopulations 
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(B)  Fused  subpopulations 
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Figure  4.3  Illustration  of  the  Wahlund  principle.  The  frequency  of  homozy- 
gous recessives  after  population  fusion  and  random  mating  is  less  than  the  aver- 
age frequency  before  fusion.  The  difference  in  frequency  of  the  homozygous 
recessives  equals  the  variance  in  allele  frequency  among  the  subpopulations. 
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across  both  subpopulations  equals  (ql  + r/f  )/2.  The  result  of  fusion  of  the  sub- 
populations is  shown  in  part  B.  Assuming  that  the  subpopulations  are  equal 
in  size,  the  allele  frequency  in  the  combined  population  is  q = (c/,  + q2) / 2,  and 
the  genotype  frequency  with  HWE  equals  q 2.  Therefore,  were  the  subpopu- 
lations in  part  A to  fuse  and  come  into  HWE,  the  average  frequency  of 
homozygous  recessives  would  be  reduced  by  an  amount  given  by: 


R 


separate 


■ R 


fused 


'll2  +<72  ^ 
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In  Equation  4.4,  we  leave  it  as  an  exercise  to  verify  that  the  expressions  in 
c/i  and  q2  on  the  first  and  second  lines  are  equal.  The  symbol  o,2  is  the  variance 
in  allele  frequency  among  the  original  subpopulations.  Because  the  variance 
is  always  nonnegative,  isolate  breaking  always  decreases  the  average  fre- 
quency of  homozygous  recessives  unless  the  allele  frequencies  are  equal  to 
begin  with.  Furthermore,  the  result  in  Equation  4.4  is  true  for  any  number  of 
subpopulations  of  equal  or  unequal  size;  in  words: 


Fusion  of  subpopulations  with  random  mating  and  HWE  decreases  the  aver- 
age frequency  of  homozygous  recessives  by  an  amount  equal  to  the  variance 
in  allele  frequency  among  the  original  subpopulations. 

To  illustrate  the  effect  of  isolate  breaking,  imagine  a subpopulation  of 
gray  squirrels  that  has  a high  frequency  of  albinism  equal  to  16%.  (Albinism 
is  an  inherited  absence  of  pigment  resulting  from  a homozygous  recessive 
gene.)  In  a nearby  forest  there  is  another  subpopulation  of  equal  size  in 
which  the  albino  mutation  is  absent,  so  that  the  allele  frequency  in  this  sub- 
population is  0.  Overall,  the  average  frequency  of  albinos  in  the  two  popula- 
tions is  (0.16  + 0)/2  = 8%.  If  the  two  subpopulations  fused  with  random 
mating  and  HWE,  the  allele  frequency  of  the  albino  mutation  in  the  fused 
population  would  be  (0.4  + 0)/2  = 0.2,  and  the  frequency  of  the  homozygous 
recessive  would  equal  0.22  = 4%.  The  frequency  of  albinos  in  the  fused  popu- 
lation is  substantially  smaller  than  the  average  frequency  in  the  original  sub- 
populations. 


PROBLEM  4.3  Tay-Sachs  disease  is  an  autosomal-recessive  degen- 
erative disorder  of  the  brain  that  usually  leads  to  death  in  infancy  or 
early  childhood.  Among  Ashkenazi  Jews,  the  incidence  of  the  condi- 
tion is  about  1 in  6000  births  but,  in  other  groups,  the  incidence  is 
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about  1 in  500,000  births  (Myrianthopoulos  and  Aronson  1966).  What 
incidence  of  the  disease  would  be  expected  among  the  offspring  of 
matings  of  Ashkenazi  Jews  with  members  of  other  groups?  If  these 
offspring  were  to  mate  randomly  among  themselves,  what  incidence 
of  the  disease  would  be  expected  in  future  generations? 


ANSWER  The  allele  frequency  of  the  Tay-Sachs  mutation  among 
Ashkenazi  Jews  is  estimated  as  qx  = V(l/6,000)  = 1.291  x KT2; 
in  other  groups,  q2  = V( 1/500,000)  = 1.414  x 1(T3.  In  matings  between 
members  of  the  two  groups,  the  expected  frequency  of  homozygous 
recessives  is  qxq2  = 1.826  x 1(T5,  or  about  1 in  55,000  births.  There  is  actu- 
ally a greater  reduction  in  the  first  generation  than  in  subsequent  gener- 
ations because  each  mating  in  the  first  generation  combines  a high- 
risk  gamete  with  a low-risk  gamete.  The  allele  frequency  in  the  first-gen- 
eration offspring  is  (qx  + q2)/2  = 7.162  x 1(T3  and,  with  HWE  in  subse- 
quent generations,  the  homozygous  recessive  frequency  stabilizes  at 
(7.162  x 1CT3)2  = 5.130  x 1(T5,  or  about  1 in  19,000  births.  (The  fact  that 
homozygous  recessives  do  not  reproduce  has  been  ignored  because  the 
effect  is  negligible.) 


Wahlund's  Principle  and  the  Fixation  Index 

Equation  4.4  applies  equally  well  to  AA  homozygotes  as  to  aa  homozygotes. 
Therefore,  letting  P represent  the  frequency  of  homozygous  AA  genotypes, 
we  can  write 


^separate  Infused  ~ 


4.5 


When  there  are  only  two  alleles,  the  total  reduction  in  homozygosity  must 
be  the  summation  of  Equations  4.4  and  4.5,  which  equals  c2  + a2.  Because 
there  are  only  two  alleles,  it  is  also  true  that  c,2  = a2,  which  we  will  write  as 
cr.  Hence,  the  total  reduction  in  homozygosity  from  the  Wahlund  effect  upon 
population  fusion  and  HWE  can  be  expressed  as  follows: 

Reduction  in  total  homozygosity  = 2a2 

On  the  other  hand,  the  reduction  in  total  homozygosity  with  popula- 
tion fusion  must  also  equal  the  increase  in  heterozygosity — the  term  HT  - 
Hs  in  Equation  4.3 — which  is  the  numerator  of  FST.  Hence,  FST  = (HT  - 
Hs)/Ht  = 2 o2/Hj.  However,  HT  is  the  heterozygosity  with  HWE  using  the 
average  allele  frequencies — p and  q — across  subpopulations.  Therefore,  the 
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connection  between  the  fixation  index  FST  and  the  variance  in  allele  fre- 
quency is 

F _ a2 

Fst  - ■zrr  4.6 

M 

Consequently,  the  F statistics  at  the  various  levels  of  a hierarchical  popu- 
lation are  related  to  the  variances  in  allele  frequencies  among  the  subpopula- 
tions grouped  together  at  the  various  levels.  Equation  4.6  affords  a 
convenient  method  of  estimating  FSJ  from  allele-frequency  data.  For  exam- 
ple, among  the  subpopulations  of  Linanthus  in  Figure  4.2,  the  variance  in 
allele  frequency  is  0.0473.  Earlier  we  calculated  the  average  allele  frequencies 
as  p = 0.8626  and  c]  = 0.1374.  Hence,  o2/ (p  x 7j)  = 0.3993,  which  confirms  the 
previous  calculation  that  FST  = 0.3993.  (The  values  as  stated  may  differ  slightly 
from  yours  because  they  were  calculated  with  more  than  four  significant  digits.) 


PROBLEM  4.4  The  data  in  the  accompanying  table  are  the  allele  fre- 
quencies of  several  genes  in  three  human  subpopulations:  (A)  blacks 
from  West  Africa;  (B)  blacks  from  Claxton,  Georgia;  and  (C)  whites 
from  Claxton,  Georgia  (Adams  and  Ward  1973).  Each  gene  has  two 
predominant  alleles  and  may,  for  purposes  of  this  problem,  be  con- 
sidered to  have  only  two  alleles.  The  genes  control  the  MN  blood 
group  (alleles  M and  N),  the  Ss  blood  group  (alleles  S and  s),  the 
Duffy  blood  group  (alleles  Fya  and  Fyb),  the  Kidd  blood  group  (alleles 
Jk " and  Jkb),  the  Kell  blood  group  (alieles  js“  and  Jsh),  the  enzyme  glu- 
cose-6-phosphate  dehydrogenase  (alleles  G6PD~  and  G6PD+,  and  (1- 
hemoglobin  (alleles  and  |3+).  For  each  gene,  use  Equation  4.6  to 
estimate  FSt  for  the  comparison  A versus  B and  for  the  comparison  A 
versus  C.  Classify  each  FST  as  indicating  little,  moderate,  great,  or  very 
great  genetic  differentiation  according  to  Wright's  qualitative  guide- 
lines. Note:  In  comparing  two  subpopulations  with  two  alleles  in 
each,  the  variance  in  allele  frequency  is  a2  = (px  - p2)2/4. 


Gene 

Subpopulation 

A 

Blacks  (West  Africa) 

B 

Blacks  (Georgia) 

C 

Whites  (Georgia) 

M 

0.474 

0.484 

0.507 

S 

0.172 

0.157 

0.279 

ry 

0 

0.045 

0.422 

r 

0.693 

0.743 

0.536 

j* 

0.117 

0.123 

0.002 

G6PD 

0.176 

0.118 

0 

P 

0.090 

0.043 

0 
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ANSWER  The  estimates  and  their  qualitative  interpretations  are  as 
shown  in  the  table.  It  is  clear  that  the  degree  of  genetic  divergence 
between  West  African  blacks  and  Georgia  blacks,  as  assessed  by  the 
average  FST  value,  is  relatively  small.  However,  some  of  the  genes 
show  substantial  genetic  divergence  between  blacks  and  whites. 
Note,  however,  that  the  fixation  index  can  differ  substantially  from 
one  gene  to  another.  This  compilation  of  genes  includes  gradations  of 
genetic  divergence  ranging  from  little  to  very  great. 


Cene 

A versus  B 

A versus  C 

M 

0.0001  (little) 

0.0011  (little) 

S 

0.0004  (little) 

0.0164  (little) 

0.0230  (little) 

0.2676  (very  great) 

jr 

0.0031  (little) 

0.0260  (little) 

fs“ 

0.0001  (little) 

0.0591  (moderate) 

G6PD~ 

0.0067  (little) 

0.0965  (moderate) 

Ps 

0.0089  (little) 

0.0471  (little) 

Average 

0.0060  (little) 

0.0734  (moderate) 

Genotype  Frequencies  in  Subdivided  Populations 

In  many  organisms  in  which  the  population  structure  is  hierarchical,  it  is 
useful  to  be  able  to  calculate  directly  the  average  genotype  frequencies  across 
all  subpopulations.  Equations  4.4  through  4.6  make  it  possible  to  deduce  the 
average  genotype  frequencies.  Consider  first  Equation  4.4,  which  pertains  to 
the  genotype  frequency  of  AA.  The  quantity  called  Dseparate  is  what  we  wish 
to  calculate:  it  is  the  average  frequency  of  AA  across  subpopulations.  The 
quantity  Dfused  equals  p2 — the  genotype  frequency  of  AA  with  population 
fusion  and  HWE.  The  value  of  o;,2  is  also  known  from  Equation  4.6:  it  equals 
Fst  rpx  q.  Putting  all  this  together,  the  average  genotype  frequency  of  AA 
across  subpopulations  must  equal  p ^ + FSjpq.  Likewise,  interpreting  Equa- 
tion 4.4  in  the  same  manner  as  Equation  4.5  yields  the  average  genotype  fre- 
quency of  aa  across  subpopulations  as  q2  + FST  pq. 

Because  every  genotype  that  is  not  homozygous  must  be  heterozygous, 
the  average  genotype  frequency  of  heterozygotes  across  subpopulations  is 
given  by  1 - (p2  + FSJpq)  - (q2  + FSJpq).  Note  that  1 - p2  - q2  = 2 pq  and  so  the 
average  frequency  of  heterozygotes  simplifies  to  2 pq  -2  pq  FST. 

The  genotype  frequencies  in  a subdivided  population  are  important 
enough  to  be  displayed: 


THt^ 
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AA:  p2+pqFST 

Aa : 2pq-2pqFST  4.7 

aa:  q2 +pqFST 

These  genotype  frequencies  are  the  average  genotype  frequencies  across 
all  subpopulations.  They  do  not  obey  the  Hardy- Weinberg  principle  because 
there  is  an  excess  of  homozygotes  and  a deficiency  of  heterozygotes  relative 
to  HWE.  The  result  is  somewhat  paradoxical  because,  within  any  particular 
subpopulation,  the  genotype  frequencies  do  obey  the  Hardy-Weinberg  prin- 
ciple with  whatever  allele  frequencies  are  found  in  that  subpopulation.  The 
reason  for  the  validity  of  HWE  within  each  subpopulation  is  the  assumption 
of  random  mating  within  each  subpopulation.  The  reason  for  the  departure 
from  HWE  in  the  population  as  a whole  is  that  the  subpopulations  differ  in 
allele  frequency.  Because  the  allele  frequencies  differ,  random  mating  within 
each  subpopulation  is  not  equivalent  to  random  mating  among  all  the  organ- 
isms in  the  entire  population. 

From  the  expressions  in  Equation  4.7,  it  is  clear  that  the  value  of  Fsx  deter- 
mines the  degree  of  departure  from  HWE.  If  Fsx  = 0,  the  second  term  in  each 
expression  vanishes,  and  the  genotype  frequencies  reduce  to  the  HWE;  on 
the  other  hand,  FST  = 0 means  that  there  is  no  variation  in  allele  frequency 
among  the  subpopulations  for  the  gene  in  question.  Because  FST  may  vary 
from  one  gene  to  the  next,  other  genes  in  the  same  subpopulations  may  have 
nonzero  values  of  Fsx.  The  extreme  case  is  FgX  = 1,  which  happens  when  two 
subpopulations  are  fixed  for  alternative  alleles.  In  this  case,  the  average  allele 
frequencies  are  V2  for  each  allele  and  the  average  genotype  frequencies  of  AA, 
Aa,  and  aa  across  subpopulations  are  V2,  0,  and  V2,  respectively.  This  case  is 
illustrated  in  Figure  4.1. 

POPULATION  GENETICS  IN  DNA  TYPING 

The  term  DNA  typing  means  the  application  of  molecular  genetics  to  high- 
ly polymorphic  genetic  markers  for  the  purpose  of  matching  DNA  samples 
from  unknown  people  with  those  of  known  suspects.  Applications  include 
paternity  testing,  in  which  DNA  from  a child  is  matched  against  that  of  an 
accused  father,  and  criminal  investigation,  in  which  a crime-scene  sample  of 
DNA  from  blood,  semen,  or  other  sources  is  matched  against  that  of  one  or 
more  suspects.  DNA  typing  undoubtedly  ranks  with  the  use  of  fingerprints 
as  a major  innovation  in  personal  identification. 

In  theory,  DNA  typing  is  not  as  powerful  as  ordinary  fingerprinting.  Fin- 
gerprints result  from  the  pattern  of  raised  skin  ridges  that  carry  sweat  glands. 
The  ridge  pattern  on  each  finger  may  form  an  arch,  loop,  whorl,  or  other 
design.  The  ridges  vary  in  pattern  from  one  person  to  the  next  so  greatly  that 
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each  person  has  unique  fingerprints  suitable  for  personal  identification. 
When  the  fingers  are  formed  in  the  embryo,  the  fingertips  develop  as  fluid- 
filled  pads.  The  fluid  is  later  resorbed  and  the  expanded  skin  collapses,  form- 
ing the  ridges.  There  is  a strong  random  component  to  the  manner  in  which 
the  skin  collapses,  and  so  the  details  of  the  fingerprint  pattern  differ  in  each 
finger  and  in  each  person.  Even  identical  twins  have  different  fingerprints. 
However,  certain  general  features  of  the  fingerprints  are  strongly  inherited — 
for  example,  the  total  number  of  ridges  on  all  the  fingers,  without  regard  to 
pattern.  The  Dionne  quintuplets— five  Canadian  girls  born  in  1934,  all 
formed  from  the  splitting  of  a single  fertilized  egg— had  total  ridge  counts 
ranging  between  99  and  102;  by  comparison,  their  older  siblings  had  total 
ridge  counts  of  69,  78,  and  139. 

It  is  the  random  component  in  fingerprint  ridge  pattern  that  makes  fin- 
gerprints so  powerful  for  personal  identification.  DNA  types  are  inherited 
and  so  are  not  necessarily  unique  in  each  person.  Even  for  a highly  polymor- 
phic marker  in  which  both  parents  are  heterozygous — for  example,  in  the 
mating  A,Aj  x AkA, — any  particular  genotype  in  an  offspring  has  a y4  chance 
of  being  matched  in  a sibling  owing  to  Mendelian  segregation.  Thus,  strong 
evidence  that  an  unknown  DNA  sample  comes  from  a particular  suspect  can 
come  only  from  the  matching  of  a combination  of  genotypes  across  a number 
of  polymorphic  loci.  The  strength  of  the  evidence  increases  with  the  number 
of  loci  that  are  examined  and  number  of  alleles  present  in  the  population.  The 
greater  the  number  of  loci,  and  the  more  highly  polymorphic  the  loci,  the 
stronger  the  evidence  linking  the  suspect  to  the  unknown  sample.  Although 
matching  DNA  types  may  provide  strong  evidence  that  a suspect  is  the 
source  of  an  unknown  sample,  a DNA  mismatch  is  usually  conclusive.  When 
the  DNA  of  a suspect  contains  alleles  that  are  clearly  not  present  in  the 
unknown  sample,  then  the  sample  must  have  originated  from  a different  per- 
son. 

Polymorphisms  Based  on  a Variable  Number  of  Tandem  Repeats 
(VNTR)  r 

The  type  of  polymorphism  usually  used  in  DNA  typing  in  the  United  States 
is  illustrated  in  Figure  4.4.  Each  allele  of  a locus  is  defined  by  the  size  of  a 
restriction  fragment  that  hybridizes  with  a locus-specific  probe  in  a Southern 
blot  (Chapter  2).  The  restriction  fragments  differ  in  size  according  to  the 
number  of  copies  they  contain  of  a short  sequence  of  nucleotides  repeated  in 
tandem.  When  there  are  more  copies  of  the  repeating  unit,  the  restriction 
fragment  is  of  greater  size.  A polymorphic  gene  of  this  type  is  called  a VNTR 
polymorphism,  which  means  that  the  restriction  fragments  contain  a vari- 
able number  of  tandem  repeats.  VNTRs  are  employed  in  DNA  typing 
because  many  alleles  are  possible  because  of  the  variable  number  of  repeat- 
ing units.  Although  many  alleles  may  be  present  in  the  population  as  a 
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Figure  4.4  Allelic  variation  resulting  from  a variable  number  of  units  repeated 
in  tandem  in  a nonessential  region  of  a gene.  The  probe  DNA  detects  a restric- 
tion fragment  for  each  allele.  The  length  of  the  fragment  depends  on  the  num- 
ber of  repeating  units  present.  (From  Hartl  1994.) 


whole,  any  one  person  can  have  no  more  than  two  alleles  of  each  VNTR 
locus.  An  example  of  a VNTR  used  in  DNA  typing  is  shown  in  Figure  4.5. 
The  lanes  in  the  gel  labeled  M contain  multiple  DNA  fragments  of  known 
size  to  serve  as  molecular-weight  markers.  Each  numbered  lane  contains 
DNA  from  a different  person.  Two  typical  features  of  VNTRs  are  to  be  noted: 

• Most  people  are  heterozygous  for  two  VNTR  alleles  with  restriction  frag- 
ments of  different  size.  Heterozygosity  is  indicated  by  the  presence  of 
two  distinct  bands.  In  Figure  4.5,  only  the  persons  numbered  2 and  5 
appear  to  be  homozygous  for  a particular  allele. 

• The  restriction  fragments  from  different  people  cover  a wide  range  of 
sizes.  The  variability  in  size  indicates  that  the  population  as  a whole  con- 
tains many  VNTR  alleles. 

Figure  4.5  also  makes  it  clear  why  VNTR  polymorphisms  are  useful  in 
DNA  typing:  each  of  the  13  people  has  a different  DNA  type  (pattern  of 
bands)  for  this  VNTR  and  therefore  could  be  distinguished  from  any  other 
person.  On  the  other  hand,  the  uniqueness  of  each  DNA  type  in  Figure  4.5 
results  in  part  from  the  small  sample  size.  If  more  people  were  examined, 
then  DNA  types  that  matched  by  chance  might  well  be  found  among  unre- 
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Figure  4.5  Genetic  variation  in  a VNTR  used  in  DNA  typing.  Each  numbered 
lane  contains  DNA  from  a single  person.  After  digestion  of  the  DNA  with  a 
restriction  enzyme,  the  fragments  are  separated  by  electrophoresis  and  hybrid- 
ized with  a radioactive  probe  DNA.  The  lanes  labeled  M contain  molecular- 
weight  markers;  lane  C is  another  type  of  internal  control.  (Courtesy  of  R.  W 
Allen.)  y 


lated  people.  For  example,  in  one  study  of  five  VNTR  loci,  the  chance  of  a 
match  between  unrelated  people  ranged  from  1/20  to  1/200,  depending  on 
the  locus  (Herrin  1993).  Although  less  common,  chance  matches  for  two 
VNTR  loci  can  also  be  found  among  unrelated  people.  The  same  study  found 
two-locus  matches  at  frequencies  of  1/2,500  to  1/50,000.  Even  chance  match- 
es for  three  VNTR  loci  are  far  from  impossible.  In  one  study  of  Italians  from 
Milan,  three-locus  matches  were  found  at  a frequency  of  approximately 
1/1,200  (Krane  et  al.  1992).  Because  of  the  possibility  of  chance  matches 
between  VNTR  types,  applications  of  DNA  typing  are  usually  based  on  at 
least  three  loci  and  preferably  more.  Matches  at  7 to  9 VNTR  loci  are  virtual- 
ly definitive  of  identity — barring  technical  errors  in  the  DNA  typing  itself 
(such  as  mislabeling  of  blood  samples)  and  except  for  identical  twins. 

DNA  typing  can  be  exclusionary  as  well  as  incriminating.  For  example,  if 
the  DNA  type  of  a suspected  rapist  does  not  match  the  DNA  type  of  semen 
taken  from  the  victim,  then  the  suspect  could  not  be  the  perpetrator — unless 
there  is  some  reason  to  suspect  that  the  test  itself  was  faulty.  For  example. 
Figure  4.6  shows  the  DNA  profiles  of  nine  VNTR  loci  among  three  suspects 
and  from  evidence  recovered  in  seven  serial  rape  cases.  The  label  M denotes 
molecular-weight  markers  (present  in  four  lanes  in  each  panel),  S1-S3 
denotes  three  suspects  in  the  cases,  and  U1-U7  denotes  DNA  from  semen 
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samples  recovered  from  the  seven  victims.  Suspects  SI  and  S3  are  excluded 
by  the  DNA  typing,  but  S2  matches  at  all  nine  loci.  Based  on  this  and  other 
evidence,  a jury  convicted  suspect  S2  of  81  criminal  counts  related  to  these 
and  other  cases.  He  was  sentenced  to  139  years  in  prison  and  will  not  become 
eligible  for  parole  until  the  year  2087. 

Match  Probabilities  with  Hardy-Weinberg  Equilibrium 
and  Linkage  Equilibrium 

If  a person  is  found  whose  DNA  type  matches  that  of  a sample  found  at  the 
scene  of  a crime,  how  is  the  significance  of  the  match  to  be  evaluated?  The 
significance  of  the  match  depends  on  the  likelihood  of  it  happening  by 
chance,  and  hence  matches  of  rare  DNA  types  are  more  telling  than  match- 
es of  common  DNA  types.  Initially,  the  method  for  estimating  the  frequency 
of  a DNA  type  in  the  population  was  to  use  a cross-multiplication  square  like 
that  in  Figure  3.6,  extended  to  multiple  alleles,  to  calculate  the  expected  fre- 
quency of  the  particular  genotype  for  each  VNTR  locus;  this  calculation 
assumes  Hardy-Weiberg  equilibrium  (HWE).  The  locus-by-locus  frequencies 
were  then  multiplied  together  to  obtain  the  expected  frequency  of  the  multi- 
locus match;  this  calculation  assumes  linkage  equilibrium.  With  HWE  and 
linkage  equilibrium,  the  expected  frequency  of  a DNA  type  in  the  population 
as  a whole  is  calculated  as 


x 


4.8 


homozygous 

loci 


heterozygous 

loci 


where  capital  n means  chain  multiplication.  The  first  multiplication  is  across 
all  loci  presumed  to  be  homozygous  owing  to  the  presence  of  a single  band 
in  the  gel;  for  each  locus,  p,  is  the  frequency  of  the  allele  that  is  homozygous. 
The  second  multiplication  is  across  all  heterozygous  loci  and,  for  each  locus, 
the  factor  is  two  times  the  product  of  the  frequencies  of  the  alleles  that  are 
heterozygous.  Because  human  subpopulations  can  differ  in  their  allele  fre- 
quencies, the  calculation  would  be  carried  out  using  allele  frequencies 
among  Caucasians  for  white  suspects,  using  those  among  blacks  for  black 
suspects,  and  using  those  among  Hispanics  for  Hispanic  suspects. 

Effects  of  Population  Substructure 

The  multiplication  in  Equation  4.8  makes  a number  of  assumptions  about 
human  populations:  (1)  that  the  Hardy-Weinberg  principle  holds  for  each 
locus,  (2)  that  each  locus  is  statistically  independent  of  the  others  so  that  the 
multiplication  across  loci  is  justified,  and  (3)  that  the  only  level  of  popula- 
tion substructure  that  is  important  for  DNA  typing  is  that  of  race.  Critics  of 
the  multiplication  rule  argued  that  genetically  important  subpopulations 
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Figure  4.6  An  example  of  DNA  typing.  Suspect  S2  matches  evidence  samples  in  seven  rape 
cases  (U1-U7)  for  each  of  nine  VNTR  loci  ( D1S7 , D2S44,  D4S139,  and  so  forth).  Suspects  SI 
and  S3  do  not  match  and  are  excluded.  The  lanes  labeled  M contain  molecular-weight  mark- 
ers. (Courtesy  of  Steven  L.  Redding,  Office  of  the  Hennepin  County  District  Attorney,  Min- 
neapolis, and  Lowell  C.  Van  Berkom  and  Carla  J.  Finis,  Minnesota  Bureau  of  Criminal 
Apprehension.) 
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need  not  coincide  with  racial  designations.  For  example,  the  term 
"Hispanic"  includes  a mixture  of  different  subpopulations  with  variable 
amounts  of  Spanish,  native  American  Indian,  and  African  ancestry. 
Similarly,  there  are  potentially  important  differences  in  allele  frequency 
among  Caucasian  populations  (for  example,  Finnish  people  versus  Italians) 
and  among  black  populations  (for  example,  blacks  from  Africa  versus 
blacks  from  Trinidad).  Furthermore,  if  the  allele  frequencies  of  different 
VNTRs  differ  among  subpopulations,  then  the  loci  are  not  statistically  inde- 
pendent— even  if  they  are  genetically  unlinked — and  so  the  multiplication 
across  loci  is  unjustified.  Because  of  population  substructure,  DNA  matches 
across  multiple  VNTRs  could  be  more  common  among  people  within  a par- 
ticular ethnic  group  than  among  people  drawn  at  random  from  the  popula- 
tion as  a whole,  and  so  calculations  of  genotype  frequency  should  be  based 
on  the  ethnic  group  of  the  accused  person  and  not  on  the  race  as  a whole. 
On  the  other  side,  defenders  of  the  multiplication  rule  argued  that  popula- 
tion substructure  would  have  a relatively  minor  effect  on  the  final  outcome 
of  the  calculation  and  that  what  matters  most  is  not  a high  degree  of  accu- 
racy but  rather  a general  sense  of  whether  a particular  multilocus  genotype 
is  rare  or  common.  After  much  acrimony  in  the  scientific  community  and  in 
courts  of  law,  a panel  of  the  National  Research  Council  (NRC  1992)  recom- 
mended a compromise  called  the  ceiling  principle  in  which  a modified  mul- 
tiplication procedure  was  adopted  using,  for  each  allele  frequency,  a "ceil- 
ing" equal  to  the  larger  of  either  0.10  or  the  upper  95%  confidence  limit  of 
the  highest  frequency  of  the  allele  observed  among  at  least  three  racial  data- 
bases. 

Even  this  recommendation  proved  controversial  because  some  population 
geneticists  regarded  the  compromise  formula  as  too  conservative.  Continu- 
ing controversy  prompted  the  formation  of  a second  panel  of  the  National 
Research  Council  (NRC  1996),  which  recommended  the  use  of  a modified 
product  rule  that  takes  moderate  population  substructure  into  account. 
According  to  this  recommendation,  in  most  cases  the  match  probability  may 
be  calculated  according  to  the  left-hand  side  of  the  following: 

n^x  n 2v<vi>  mw-PiM*  ni^pi -typist] 

homozygous  heterozygous  homozygous  heterozygous 

loci  loci  loci  loci 

In  this  expression,  p,  and  p,  have  the  same  meaning  as  in  Equation  4.8,  and 
Fst  is  the  fixation  index  among  the  subpopulations  in  the  larger  whole  (typi- 
cally a major  racial  group).  The  use  of  the  calculation  is  justified  by  the 
inequality.  Each  factor  on  the  right-hand  side  of  this  inequality  is  the  per- 
locus  genotype  frequency  calculated  from  Equation  4.7,  which  takes  FST  into 
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account.  The  left-hand  side  is  greater  than  the  right-hand  side  because,  for 
each  homozygous  locus,  it  can  be  shown  that  2 p,  > p2  + p,(  1 - p,)Fs T;  and,  for 
each  heterozygous  locus,  it  is  clear  that  2 p,pj  > 2plpj  - 2p,pl  FS1  because  fST  > 0. 
Equally  as  important  as  the  calculation  itself,  the  committee  emphasized,  was 
the  principle  that  no  probability  value  should  be  cited  unless  accompanied 
by  an  appropriate  95%  confidence  interval  to  indicate  its  degree  of  reliability. 
The  1996  report  also  enumerated  a number  of  special  situations  in  which 
alternative  formulas  are  required  because  of  population  substructure  or 
inbreeding. 


INBREEDING 

When  matings  take  place  between  relatives,  the  pattern  of  mating  is  called 
inbreeding.  In  human  beings,  the  closest  degree  of  inbreeding  usually 
encountered  in  most  societies  is  first-cousin  mating.  Many  plants  regularly 
undergo  self-fertilization,  and  some  insects  regularly  practice  brother-sister 
mating.  Inbreeding  need  not  unite  close  relatives,  however.  As  we  shall  see, 
a certain  level  of  inbreeding  is  inescapable  in  small  subpopulations  because 
the  members  of  a subpopulation  typically  share  recent  or  remote  common 
ancestors.  The  common  ancestry  between  mating  pairs  constitutes  inbreed- 
ing. Hence,  the  genetic  differentiation  among  subpopulations  described  by 
the  hierarchical  F statistics  can  be  interpreted  as  a sort  of  inbreeding  effect 
resulting  from  population  substructure.  The  relationship  between  popula- 
tion substructure  and  inbreeding  is  a subtle  one,  but  it  has  profound  conse- 
quences in  population  genetics. 

Genotype  Frequencies  with  Inbreeding 

The  main  effect  of  population  substructure  is  a decrease  in  average  het- 
erozygosity among  subpopulations,  relative  to  the  heterozygosity  expected 
with  random  mating  in  a hypothetical  total  population.  Likewise,  the  main 
effect  of  inbreeding  is  to  produce  organisms  with  a decrease  in  heterozygos- 
ity, relative  to  the  heterozygosity  expected  with  random  mating  in  the  same 
subpopulation.  The  decrease  in  heterozygosity  due  to  inbreeding  can  be 
illustrated  with  the  example  of  repeated  self-fertilization.  Consider  a self- 
fertilizing population  of  plants  that  consists  of  V4  AA,  V2  Aa,  and  % aa  geno- 
types, which  are  in  Hardy-Weinberg  proportions.  Because  each  plant  under- 
goes self-fertilization,  the  AA  and  aa  genotypes  produce  only  AA  and  aa  off- 
spring, respectively,  and  the  Aa  genotypes  produce  V4  AA,  V2  Aa,  and  i/4  aa 
offspring.  After  one  generation  of  self-fertilization,  therefore,  the  genotype 
frequencies  of  AA,  Aa,  and  aa  are: 

A A: 


V4  x 1 + y2  x i/4  - 3/8 
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Aa : 


V2  x V2  - % 

V4  x 1 + V2  x V4  = 3/g 


aa: 


These  genotype  frequencies  are  no  longer  in  Hardy- Weinberg  propor- 
tions. There  is  a deficiency  of  heterozygous  genotypes  and  an  excess  of 
homozygous  genotypes.  After  a second  generation  of  self-fertilization,  the 
genotype  frequencies  are  7/16  AA,  2/16  Aa,  and  7/i6  aa,  which  have  an  even 
greater  deficiency  of  heterozygotes.  Note,  however,  that  the  allele  frequency 
of  A remains  constant.  Denoting  the  allele  frequency  of  A as  p,  then: 

In  the  initial  population:  p = y4  + l/2  x y2  = x/2 

After  one  generation  of  selfing:  p = 3/8  + V2  x 2/8  = l/2 

After  two  generations  of  selfing:  P = 7/l6  + V2  X 2/16  = V2 

The  example  of  self-fertilization  illustrates  the  general  principle  that 
inbreeding,  by  itself,  does  not  change  the  allele  frequency.  One  assumption 
required  for  constant  allele  frequencies  under  inbreeding  is  that  all  geno- 
types must  have  an  equal  likelihood  of  survival  and  reproduction,  which  is  to 
say  that  no  natural  selection  takes  place.  If  there  is  selection,  then  the  allele 
frequencies  can  change  with  inbreeding  (or,  for  that  matter,  with  any  mating 
system). 

The  effects  of  inbreeding  can  be  made  quantitative  by  comparing  the  pro- 
portion of  heterozygous  genotypes  among  inbred  organisms  with  the  pro- 
portion of  heterozygous  genotypes  expected  with  random  mating.  To  be 
precise,  consider  a gene  with  two  alleles,  A and  a,  at  respective  frequencies  p 
and  q (with  p + q = 1).  Suppose  that  the  frequency  of  heterozygous  genotypes 
in  a subpopulation  of  inbred  organisms  is  some  quantity  HT.  Were  the  sub- 
population undergoing  random  mating,  the  HWE  frequency  of  heterozygous 
genotypes  would  be  2 pq.  However,  for  the  sake  of  generality,  we  will  denote 
the  random-mating  heterozygosity  by  the  symbol  H0.  The  effects  of  inbreed- 
ing can  be  defined  as  the  proportionate  reduction  in  heterozygosity  relative 
to  random  mating.  This  value  is  expressed  mathematically  as  (H0  - H])/H0; 
this  ratio  is  usually  denoted  by  the  symbol  F,  which  is  called  the  inbreeding 
coefficient.  At  this  point,  the  use  of  F for  the  inbreeding  coefficient  may  seem 
a poor  choice  in  view  of  the  use  of  FST  and  related  symbols  for  measuring  the 
effects  of  population  substructure,  but  we  will  see  in  a few  moments  that 
inbreeding  and  population  substructure  are  intimately  related. 

Thus  we  define 


(Hp  -ff;) 
H0 
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In  biological  terms,  F measures  the  fractional  reduction  in  heterozygosity 
of  an  inbred  subpopulation  relative  to  a random-mating  subpopulation  with 
the  same  allele  frequencies.  Because  H0  = 2pq,  the  frequency  of  heterozygous 
genotypes  in  the  inbred  subpopulation  can  be  written  in  terms  of  F as 
Hi  = H0  - H0F  = H0(  1 - F)  = 2pc](l  - F). 

The  frequency  of  /l/l  homozygous  genotypes  in  an  inbred  subpopula- 
tion can  also  be  expressed  in  terms  of  F.  Suppose  that  the  proportion  of  AA 
genotypes  is  denoted  P.  Because  the  allele  frequency  of  A is  p,  we 
must  have,  by  Equation  4.9  that  P + Hx/2  = p.  But  Hx  = 2pq(l  - F),  and  so 
P = p - 2pq(l  - F)/2. 


PROBLEM  4.5  Use  the  relation  P = p-  2pq(l  - F)/2  and  the  fact  that 
p + q = 1 to  show  that  P = p2  + pqF.  Show  also  that  P can  be  written  as 
P = p2(l  - F)  + pF. 


ANSWER  P = p-  2pq(l  - F)/2  = p - pq(  1 -F)  = p-pq  + pqF  = p(  1 - q) 
+ pqF  = p2  + pqF.  This  establishes  the  first  identity.  Then,  substituting 
for  q in  the  second  term,  P = p2  + p(  1 - p)F  = p2  + pF  - p2F  = p2(  1 - F)  + 
pF. 


Problem  4.5  shows  that  the  frequency  of  AA  genotypes  in  an  inbred  sub- 
population equals  p2(l  - F)  + pF.  In  a similar  manner,  it  can  be  shown  that  the 
frequency  of  aa  genotypes  is  q2{  1 - F)  + qF. 

In  summary,  in  a subpopulation  of  organisms  with  inbreeding  coefficient 
F,  the  genotype  frequencies  are  expected  in  the  proportions: 

AA  : p2(l  - F)  + pF  = p2  + pqF 
Aa  : 2pq(l  - F)  = 2 pq  - 2 pqF  4.10 

aa  : q2(  1 - F)  + qF  = q2  + pqF 

The  expressions  at  the  far  right  in  Equation  4.10  facilitate  comparison  of 
the  genotype  frequencies  expected  with  inbreeding  relative  to  those  expected 
with  HWE.  With  inbreeding,  there  is  a deficiency  of  heterozygotes  equal  to 
2 pqF  and  an  excess  of  each  homozygous  class  equal  to  half  the  deficiency  of 
heterozygotes.  The  biological  reason  that  the  missing  heterozygotes  are  allo- 
cated equally  to  the  two  homozygous  classes  is  that  each  heterozygous  geno- 
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type  contains  one  A and  one  a allele.  Notice  that  when  there  is  no  inbreeding 
(F  = 0),  the  genotype  frequencies  are  in  the  familiar  Hardy- Weinberg  propor- 
tions; with  complete  inbreeding  (F  = 1),  the  inbred  subpopulation  consists 
entirely  of  AA  and  aa  homozygotes  in  the  frequencies  p and  q,  respectively. 

If  a gene  has  multiple  alleles  AX/  A2, . . . , A„  at  respective  frequencies  px, 
Pir  ■ • ■ * Vn  (with  Pi  + pi  + •••  + P„  = 1),  then  in  a population  with  inbreeding 
coefficient  F,  the  frequencies  of  A, A,  homozygotes  and  A,Aj  heterozygotes  are 
as  follows: 


p,2(l-F)  + p,F 
2piPj(l-F) 


4.11 


We  are  now  in  a position  to  apply  the  Equations  4.10  and  4.11  to  real  data. 


PROBLEM  4.6  Plants  able  to  undergo  self-fertilization  are  said  to 
be  self-compatible.  In  a population  of  self-compatible  plants,  if  each 
plant  undergoes  self-fertilization  a fraction  s of  the  time  and  other- 
wise mates  randomly,  then  it  can  be  shown  (Crow  and  Kimura 
1970;  Hedrick  and  Cockerham  1986)  that  F very  quickly  attains  the 
value  F = s/( 2 - s).  Phlox  cuspidata  is  self-compatible,  and  for  this 
species  the  amount  of  self-fertilization  is  estimated  at  approximate- 
ly s = 0.78  (Levin  1978).  From  s we  can  predict  the  inbreeding  coef- 
ficient as  F = 0.78/ (2  - 0.78)  = 0.64.  In  a Texas  population  of  P. 
cuspidata,  Levin  (1978)  found  two  electrophoretic  alleles  of  the 
phosphoglucomutase-2  gene,  designated  Pgm-2"  and  Pgm-2h.  In  a 
sample  of  35  plants,  there  were  15  Pgm-2" /Pgm-2",  6 Pgm-2" /Pgm-2b, 
and  14  Pgm-2h/Pgm-2b  genotypes.  Are  these  numbers  consistent 
with  the  estimate  F = 0.64?  (Note:  The  yg  in  this  case  has  one  degree 
of  freedom  because  only  the  allele  frequency  is  estimated  from  the 
data;  if  F also  were  estimated  from  the  data,  rather  than  being  cal- 
culated independently  from  the  degree  of  self-fertilization,  then 
there  would  be  zero  degrees  of  freedom  and  no  goodness-of-fit  test 
would  be  possible.) 


ANSWER  The  allele  frequencies  of  Pgm-2 " and  Pgm-2b  are  estimated 
as  (30  + 6)/70  = 0.514  and  1 - 0.514  = 0.486,  respectively.  The  hypoth- 
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esis  is  that  F = 0.64,  and  so  1 - F = 0.36.  The  expected  numbers  of  the 
genotypes  act,  ab,  and  bb  are,  respectively,  [(0.514)2(0.36)  + (0.514) 
(0.64)](35)  = 14.8,  [2(0.514)(0.486)(0.36)](35)  = 6.3,  and  [(0.486)2(0.36)  + 
(0.486)(0.64)](35)  = 13.9.  With  these  expectations,  the  y2  = 0.02  with 
one  degree  of  freedom,  and  the  associated  probability  is  about  0.96. 
The  fit  to  the  inbreeding  model  is  excellent. 


PROBLEM  4. 7 Assuming  that  F = 0.64  in  Texas  populations  of  Phlox 
cuspidata,  calculate  the  genotype  frequencies  expected  from  the  four 
alleles  of  the  gene  Adh  coding  for  alcohol  dehydrogenase  by  using  the 
allele  frequencies  0.11  ( Adh-1 ),  0.84  ( Adh-2 ),  0.01  ( Adh-3 ),  and  0.04 
( Adh-4 ) from  Problem  3.10  in  Chapter  3. 


ANSWER  Using  the  expressions  in  Equation  4.11,  the  expected 
genotype  frequencies  are:  Adh-1  /Adh-1  = 0.0748,  Adh-1 /Adh-2  = 
0.0665,  Adh-2 /Adh-2  = 0.7916,  Adh-1  /Adh-3  = 0.0008,  Adh-2/ Adh-3  = 
0.0060,  Adh-3 /Adh-3  = 0.0064,  Adh-1 /Adh-4  = 0.0032,  Adh-2 /Adh-4  = 
0.0242,  Adh-3/ Adh-4  = 0.0003,  Adh-4/ Adh-4  = 0.0262. 


Relation  Between  the  Inbreeding  Coefficient 
and  the  F Statistics 

There  is  an  intimate  relation  between  the  inbreeding  coefficient  F and  the 
hierarchical  F statistics  examined  in  the  first  section  of  this  chapter.  Each  of 
the  hierarchical  F statistics  is  also  a type  of  inbreeding  coefficient  that  mea- 
sures the  reduction  in  heterozygosity  at  any  level  of  a population  hierarchy, 
relative  to  a higher  level.  The  connection  between  the  inbreeding  coefficient 
and  the  F statistics  is  indicated  by  the  formal  similarity  between  Equation  4.7 
and  the  right-hand  side  of  Equation  4.10.  To  incorporate  the  inbreeding  coef- 
ficient F from  mating  between  relatives  into  the  hierarchical  framework,  we 
will  embellish  it  with  the  subscript  IS.  In  words,  F\$  is  the  inbreeding  coeffi- 
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dent  of  a group  of  inbred  organisms  relative  to  the  subpopulation  to  which 
they  belong.  The  value  of  FIS  is  the  reduction  in  heterozygosity  of  the  inbred 
organisms,  and  the  genotype  frequencies  among  the  inbred  organisms  are 
given  by  Equation  4.10  with  p and  q equal  to  the  allele  frequencies  in  the  rel- 
evant subpopulation.  Within  each  subpopulation  there  is  random  mating, 
and  so  the  genotype  frequencies  are  given  by  the  HWE.  Among  the  subpop- 
ulations, however,  there  is  a reduction  in  average  heterozygosity,  relative  to 
the  total  population,  because  mates  within  subpopulations  often  share 
remote  common  ancestors.  The  sharing  of  remote  common  ancestors 
explains  the  apparent  paradox  that  inbreeding  accumulates  even  when  there 
is  random  mating  within  a subpopulation.  The  reduction  in  heterozygosity 
attributable  to  this  type  of  inbreeding,  relative  to  the  total  population,  is  mea- 
sured by  Fst,  and  the  appropriate  formulas  for  the  genotype  frequencies, 
averaged  across  the  subpopulations,  are  given  in  Equation  4.7,  in  which  p 
and  q are  the  average  allele  frequencies  among  the  subpopulations. 

A population  geneticist  is  often  interested  not  only  in  FIS  but  also  in  FIT. 
The  former  is  the  heterozygosity  of  a group  of  organisms  relative  to  the  sub- 
population to  which  they  belong;  the  latter  is  the  heterozygosity  of  the  inbred 
organisms  relative  to  the  total  population.  Hence,  FIT  is  the  most  inclusive 
measure  of  all  inbreeding.  It  embraces  not  only  the  effects  of  mating  between 
close  relatives  within  a subpopulation  but  also  the  accumulated  inbreeding 
resulting  from  mating  between  remote  relatives  at  all  levels  of  the  population 
hierarchy.  An  expression  for  FIT  is  implicit  in  the  definitions.  For  consistency, 
we  will  use  the  symbol  Ffg  to  denote  the  heterozygosity  in  a particular  sub- 
population. Hence,  Equation  4.9  defining  FIS  may  be  rewritten  as: 


hs  = 


Hs-Hi 

Hs 


4.12 


Similarly,  if  we  use  FfT  to  denote  the  heterozygosity  in  the  total  popula- 
tion, the  analogous  equation  defining  FIT  is: 


F[T  - 


HT  -H i 
HT 


4.13 


Consequently,  1 — FIS  = Hj/Hs  and  1 - FIT  = Hj/Ht.  However,  the  remarks 
in  Problem  4.1  also  indicate  that  1 - FST  = HS/HT,  and  so  by  multiplication, 

(1-Fis)(1-Fst)  = 1-F1t  4.14 


Hence,  if  we  know  both  FIS  and  FST,  then  we  can  obtain  FIT  from  Equation 
4.14.  The  value  of  FST  that  results  from  mating  between  remote  relatives  in  a 
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subpopulation  of  limited  size  is  taken  up  in  Chapter  7.  The  value  of  FIS  result- 
ing from  mating  between  close  relatives  within  a subpopulation  can  be  cal- 
culated from  the  pedigree  of  the  inbred  organisms  by  using  an  alternative 
probability  interpretation  of  FiS  defined  in  the  next  section. 

The  Inbreeding  Coefficient  as  a Probability 

Tire  inbreeding  coefficient  FIS — which  we  will  again  call  simply  F unless  the 
subscripts  are  needed  for  clarity — has  an  interpretation  in  terms  of  probability 
m addition  to  its  interpretation  in  terms  or  heterozygosity  spelled  out  in 
Equation  4.12.  The  probability  interpretation  is  important  in  the  calculation  of 
F from  pedigrees.  To  express  the  inbreeding  coefficient  in  terms  of  probability, 
imagine  the  two  alleles  of  a gene  present  in  a single  inbred  organism.  Because 
the  organism  is  inbred,  the  parents  share  one  or  more  common  ancestors.  The 
two  alleles  present  in  the  inbred  organism  could  have  been  derived  from  the 
same  ancestral  allele  by  DNA  replication  in  one  of  the  common  ancestors.  In 
this  case,  the  alleles  are  said  to  be  identical  by  descent  (IBD),  and  the  genotype 
of  the  inbred  organism  is  said  to  be  autozygous.  Conversely,  the  alleles  may  not 
be  replicas  of  a single  ancestral  allele,  in  which  case  the  alleles  are  not  identical 
by  descent,  and  the  genotype  is  said  to  be  allozygous.  The  probability  inter- 
pretation of  the  inbreeding  coefficient  is  that  F is  the  probability  that  the  two 
alleles  of  a gene  in  an  inbred  organism  are  IBD  (autozygous).  Note  that  the  con- 
cepts of  autozygosity  and  allozygosity  have  nothing  to  do  with  the  state  of  an 
allele  whether  the  allele  is  A or  a,  for  example.  The  concepts  are  concerned 
only  with  common  ancestry.  If  the  alleles  are  replicas  of  a single  allele  in  a 
common  ancestor,  they  are  autozygous;  otherwise,  they  are  allozygous. 

Interpreted  as  the  probability  of  autozygosity,  the  inbreeding  coefficient  is 
clearly  a relative  concept.  F measures  the  probability  of  autozygosity  relative 
to  some  ancestral  subpopulation.  In  defining  the  ancestral  subpopulation,  we 
arbitrarily  assume  that  all  alleles  present  in  the  ancestral  population  are  not 
identical  by  descent.  The  inbreeding  coefficient  of  an  organism  in  the  present 
population  is  then  the  probability  that  the  two  alleles  of  a gene  in  the  inbred 
organism  arose  by  replication  of  a single  allele  more  recently  than  the  time  at 
which  the  ancestral  population  existed.  The  ancestral  population  need  not  be 
remote  in  time  from  the  present  one.  Indeed,  the  ancestral  population,  usu- 
ally presumed  to  be  noninbred  (FIS  = 0),  typically  refers  to  the  population 
existing  just  a few  generations  previous  to  the  present  one,  and  FIS  in  the 
present  population  then  measures  inbreeding  that  has  accumulated  in  the 
span  of  these  few  generations.  (Technically,  any  prior  inbreeding  is  allocated 
to  Fsx.)  Because  the  span  of  time  is  usually  short,  the  possibility  of  mutation 
can  safely  be  ignored.  Autozygous  genotypes  must  therefore  be  homozygous 
for  some  allele  of  the  gene  under  consideration.  On  the  other  hand,  allozy- 
gous genotypes  can  be  either  homozygous  or  heterozygous. 
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Figure  4.7  In  a genotype  that  is  autozygous,  homologous  alleles  are  derived 
from  a single  DNA  sequence  in  an  ancestor,  and  they  are  therefore  identical  by 
descent.  In  an  allozygous  genotype,  homologous  alleles  are  not  identical  by 
descent.  As  shown  here,  allozygous  genotypes  may  be  heterozygous  or 
homozygous,  but  autozygous  genotypes  must  be  homozygous  (except  in  the 
unlikely  event  that  one  allele  has  mutated). 


Figure  4.7  illustrates  how  the  concepts  of  autozygosity  and  allozygosity  are 
related  to  those  of  homozygosity  and  heterozygosity.  The  essential  point  is  that 
two  alleles  can  be  identical  by  state  (IBS),  which  means  that  they  have  the 
same  sequence  of  nucleotides  along  the  DNA,  without  being  identical  by 
descent.  The  concept  of  identity  by  descent  pertains  to  the  ancestral  origin  of 
an  allele  and  not  to  its  chemical  makeup.  Although,  as  shown  in  Figure  4.7,  two 
distinct  alleles  that  are  identical  by  state  (for  example,  two  A}  alleles  or  two  A2 
alleles)  may  come  together  in  fertilization  and  thereby  make  the  inbred  organ- 
ism homozygous,  the  alleles  in  the  ancestral  population  are,  by  definition,  not 
identical  by  descent,  and  so  the  genotype  is  allozygous.  Similarly,  although  a 
heterozygous  genotype  must  be  allozygous  (ignoring  mutation),  a homozy- 
gous genotype  may  be  either  autozygous  or  allozygous  (see  Figure  4.7). 

The  probability  interpretation  of  the  inbreeding  coefficient  results  in  the 
same  expected  genotype  frequencies  as  the  heterozygosity  interpretation  set 
out  in  Equation  4.10.  To  verify  the  equivalence,  we  need  only  consider  the 
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implications  of  the  probability  definition  for  a subpopulation  of  inbred 
organisms.  For  this  purpose,  imagine  a subpopulation  in  which  the  organ- 
isms have  average  inbreeding  coefficient  F.  Consider  the  alleles  of  a gene  pre- 
sent in  any  one  of  the  inbred  organisms.  Either  of  two  things  must  be  true: 
the  alleles  must  either  be  allozygous  (probability  1 - F)  or  be  autozygous 
(probability  F).  If  the  alleles  are  allozygous,  then  the  probability  that  the  cho- 
sen organism  has  any  particular  genotype  is  simply  the  probability  of  that 
genotype  in  a random-mating  population,  because,  by  chance,  the  inbreeding 
has  not  affected  this  particular  gene.  On  the  other  hand,  if  the  alleles  are 
autozygous,  then  the  chosen  organism  must  be  homozygous,  and  the  proba- 
bility of  homozygosity  for  any  particular  allele  is  simply  the  frequency  of 
the  allele  in  the  subpopulation  as  a whole.  (Because  the  alleles  in  question  are 
autozygous,  knowing  which  allele  is  present  in  one  chromosome  immediate- 
ly tells  you  that  an  identical  allele  is  in  the  homologous  chromosome.)  These 
considerations  hold  regardless  of  the  number  of  alleles  but,  to  simplify  mat- 
ters, suppose  there  are  only  two  alleles  A and  a at  frequencies  p and  q (with 
p + q = 1).  The  probability  that  an  organism  has  genotype  AA  is  therefore 
P~(  1 ~F)  + pF-  In  this  expression,  the  first  term  refers  to  cases  in  which  the 
alleles  are  allozygous  and  the  second  to  cases  in  which  the  alleles  are  autozy- 
gous. Similarly,  the  probability  that  an  organism  has  genotype  aa  is  q2(  1 — F) 
+ qF.  Heterozygous  Aa  genotypes  then  have  the  frequency  2pq(l  - F)  since 
alleles  that  are  heterozygous  must  be  allozygous. 

The  genotype  frequencies  with  inbreeding  are  summarized  graphically  in 
Figure  4.8.  The  box  is  divided  vertically  into  two  parts,  corresponding  to 
genes  whose  alleles  remain  allozygous  in  spite  of  the  inbreeding  and  those 
whose  alleles  are  autozygous  because  of  the  inbreeding.  The  division  is  in  the 
proportion  1 - F : F.  Within  the  allozygous  part  of  the  box,  the  horizontal  pan- 
els correspond  to  the  allozygous  genotypes  AA,  Aa,  and  aa,  which  are  the 
Hardy-Weinberg  frequencies.  Within  the  autozygous  part  of  the  box,  the  hor- 
izontal panels  correspond  to  the  autozygous  genotypes  AA  and  aa,  which  are 
in  the  proportions  p : q.  The  formulas  for  the  genotype  frequencies  with 
inbreeding  are  given  in  Table  4.3.  Note  that  the  genotype  frequencies  are 
exactly  the  same  as  those  given  in  the  Equations  4.10.  This  result  shows  that 
the  autozygosity  definition  of  F and  the  heterozygosity  definition  of  F, 
though  superficially  quite  different,  are  actually  equivalent. 

Corresponding  to  the  probability  interpretation  of  FIS,  there  is  also  a prob- 
ability interpretation  of  FST.  However,  the  comparison  is  not  between  homol- 
ogous alleles  in  the  same  organism  but  between  homologous  alleles  drawn  at 
random  from  the  same  subpopulation.  Specifically,  FST  is  the  probability  of 
IBD  between  two  alleles  drawn  at  random  from  the  same  subpopulation. 
However,  the  inbreeding  at  this  level  is  not  realized  as  a departure  from  HWE 
but  rather  as  differences  in  allele  frequency  among  the  subpopulations 
(Equation  4.6).  The  variance  in  allele  frequency,  in  turn,  results  in  a departure 
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Figure  4.8  Graphical  representation  of  the  effects  of  inbreeding  on  genotype 
frequencies.  Some  genes  remain  allozygous  in  spite  of  the  inbreeding,  and 
among  these  the  genotype  frequencies  of  AA,  Aa,  and  aa  are  given  by  the 
Hardy-Weinberg  principle.  Other  genes  are  autozygous  because  of  the  inbreed- 
ing, and  among  these  the  genotype  frequencies  of  AA  and  an  are  given  by  the 
allele  frequencies.  There  are  no  heterozygotes  in  the  autozygous  case  because 
the  two  alleles  present  at  an  autozygous  locus  are,  by  definition,  identical  by 
descent. 


TABLE  4.3  GENOTYPE  FREQUENCIES  WITH  INBREEDING 
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from  HWE  in  the  genotype  frequencies  when  averaged  across  subpopula- 
tions (Equation  4.7).  The  probability  interpretation  of  fST  makes  the  meaning 
of  Equation  4.14  transparent.  It  says  that,  in  the  total  population,  a pair  of 
alleles  will  escape  being  IBD  (1  - fIT)  only  if  they  escape  the  effects  of  mating 
between  close  relatives  (1  - FIS)  and,  independently,  if  they  escape  the  cumu- 
lative inbreeding  effects  of  mating  between  remote  relatives  due  to  popula- 
tion substructure  (1  - FST). 

Genetic  Effects  of  Inbreeding 

In  outcrossing  species,  which  means  species  that  regularly  avoid  inbreeding, 
close  inbreeding  is  generally  harmful.  The  effects  are  seen  most  dramatical- 
ly when  inbreeding  is  complete  or  nearly  complete.  Although  nearly  com- 
plete autozygosity  can  be  approached  in  most  species  by  many  generations 
of  brother-sister  mating,  autozygosity  of  entire  chromosomes  can  easily  be 
accomplished  in  Drosophila  by  the  sort  of  mating  scheme  shown  in  Figure 
4.9.  In  this  diagram,  Cy  ( Curly  wings)  and  Pm  ( Plum-colored  eyes)  are  domi- 
nant mutations  present  in  certain  laboratory  second  chromosomes  that  carry 
several  long  inversions  to  prevent  recombination.  In  step  A,  a wildtype  fly  is 
mated  with  Cy/Pm;  four  genotypes  of  offspring  are  produced  because  the 
wildtype  fly  is  heterozygous  for  two  different  wildtype  chromosomes.  From 
each  cross  in  A,  a single  Cy  son  is  chosen  and  mated  with  Cy/Pm.  This  step 
is  shown  in  part  B.  Three  classes  of  progeny  are  produced  (because  Cy /Cy 
is  lethal);  moreover,  from  each  mating  the  Cy/+  progeny  all  carry  wildtype 
second  chromosomes  that  are  IBD  because  they  originated  by  replication  of 
a single  chromosome  in  the  previous  generation.  In  the  cross  in  part  C,  the 
Cy/  + progeny  from  part  B are  mated  among  themselves;  the  expected 
progeny  are  +/+  and  Cy / + in  the  ratio  V3  : %,  and  the  wildtype  homozy- 
gotes have  second  chromosomes  that  are  IBD.  For  chromosome  2,  these  flies 
are  completely  inbred.  In  the  mating  D,  Cy / + flies  carrying  two  different 
wildtype  chromosomes  are  crossed;  again  the  expected  progeny  are  +/+  and 
Cy/  + in  the  ratio  V3  : %,  but  in  this  case  the  wildtype  flies  are  heterozygous 
for  different  copies  of  chromosome  2 and  are  not  completely  inbred. 

For  the  matings  in  part  C and  part  D,  an  estimate  v of  the  viability  (abili- 
ty to  survive)  of  the  +/+  genotype,  relative  to  that  of  the  Cy/+  genotype,  is 
given  by 

1 2 x Number  (+/+) 

1 + Number  (Cy  / +) 

where  Number  (+/ +)  and  Number  (Cy/  +)  are  the  counts  of  wildtype  and 
Curly  offspring,  respectively  (Haldane  1956).  The  addition  of  1 to  the  denom- 
inator makes  the  estimate  of  v almost  unbiased.  When  the  total  number  of 
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offspring  is  large,  v is  essentially  equal  to  two  times  the  number  of  wildtype 
offspring  divided  by  the  number  of  Curly  offspring. 

Results  of  an  experiment  using  the  procedure  in  Figure  4.9  are  shown  in 
Figure  4.10.  It  is  evident  that  the  homozygous  genotypes  (shaded  histogram) 
are  relatively  poor  in  viability.  In  fact,  about  37%  of  the  homozygotes  are 
lethal.  Moreover,  among  the  homozygotes  that  have  viabilities  within  the 
normal  range  of  heterozygotes  (open  histogram),  virtually  all  can  be  shown 
to  have  reduced  fertility  (Sved  1975;  Simmons  and  Crow  1977).  Inbreeding  so 
close  as  to  make  entire  chromosomes  homozygous  is  rare  in  outcrossing 
species,  except  in  the  kind  of  experiment  in  Figure  4.9,  but  the  effects  are 
clearly  very  harmful  and  provide  a new  dimension  of  genetic  diversity.  In  the 
case  of  allozymes,  genetic  diversity  results  from  common  alleles  that  do  not 
perceptibly  impair  viability  or  fertility  when  homozygous.  In  the  case  of 
inbreeding,  the  effects  are  mainly  due  to  rare  alleles  that  are  severely  detri- 
mental when  homozygous.  (The  fact  that  the  alleles  are  rare  is  shown  by  the 
small  proportion  of  lethal  or  near-lethal  heterozygotes.)  Figure  4.10  shows 
that  natural  populations  of  Drosophila  contain  considerable  hidden  genetic 
variation  in  the  form  of  rare  deleterious  recessive  alleles. 

Detrimental  effects  of  inbreeding,  called  inbreeding  depression,  are 
found  in  virtually  all  outcrossing  species,  and  the  more  intense  the 
inbreeding,  the  more  harmful  the  effects.  Inbreeding  in  human  beings  is 
also  generally  harmful,  but  the  effect  is  difficult  to  measure  because  the 
degree  of  inbreeding  is  less  than  that  in  experimental  organisms;  the 
effects  may  also  vary  from  population  to  population.  Nevertheless,  chil- 
dren of  first-cousin  matings  are,  on  the  average,  less  capable  than  nonin- 
bred  children  in  any  number  of  ways  (for  example,  higher  rate  of 
mortality,  lower  IQ  scores) — although  it  should  be  emphasized  that  many 
such  children  are  within  the  normal  range  of  abilities  and  some  are  quite 
gifted.  As  in  most  organisms,  inbreeding  depression  is  largely  due  to  the 


Figure  4.9  Mating  scheme  to  extract  wildtype  chromosomes  (in  this  case,  the  second  chro- 
mosome) from  populations  of  Drosophila  melanogaster . Cy  ( Curly  wings)  and  Pm  ( Plum  eye 
color)  are  dominant  mutations  contained  in  certain  special  laboratory  chromosomes  that  have 
multiple  inversions  to  prevent  recombination.  From  each  mating  of  the  type  in  part  A,  a single 
Cy  son  (containing  one  wildtype  second  chromosome)  is  selected.  This  son  is  backcrossed 
(part  B)  in  order  to  reproduce  many  replicas  of  the  second  chromosome;  the  Cy  progeny  are 
selected  for  further  mating,  and  the  other  progeny  are  discarded.  Brother-sister  mating  as  in 
part  C is  expected  to  produce  V4  Cy/Cy,  V2  Cy/+,  and  V4  +/+  zygotes  (where  + denotes  the 
wildtype  second  chromosome);  the  Cy  / Cy  zygotes  do  not  survive,  and  so  the  surviving  off- 
spring are  % Cy/+  ( Curly  wings)  and  V3  +/+  (wildtype  straight  wings).  Mating  as  in  part  D, 
between  a female  containing  one  wildtype  second  chromosome  and  a male  carrying  a differ- 
ent one,  are  also  expected  to  produce  % Curly-winged  and  V3  straight-winged  progeny.  Flow- 
ever,  in  mating  C,  the  straight-winged  flies  are  homozygous  for  a single  wildtype  second 
chromosome;  whereas;  in  mating  D,  the  straight-winged  flies  are  heterozygous  for  two  differ- 
ent wildtype  second  chromosomes. 
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(A)  Mate  and  select  single  Curly-winged  son. 


(B)  Backcross  a single  Cy  male  from  (A)  and  select  Curly  sons  and  daughters, 
which  are  heterozygous. 


(C)  Mate  heterozygotes  for  same  wildtype  chromosome  and  count  proportion 
of  non -Curly  offspring. 


Expect 


! non-Cy 


(D)  Mate  heterozygotes  for  different  chromosomes  and  count  proportion 
of  non -Curly  offspring. 
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Figure  4.10  Viability  distributions  of  wildtype  homozygotes  (shaded  area) 
and  wildtype  heterozygotes  (black  outline)  of  second  chromosomes  extracted 
from  Drosophila  melanogaster  according  to  the  mating  scheme  in  Figure  4.9.  The 
histograms  depict  results  of  testing  691  homozygous  combinations  and  688  het- 
erozygous combinations.  Note  that,  in  this  sample,  nearly  37%  of  the  wildtype 
chromosomes  are  lethal  when  homozygous,  and  many  more  have  viabilities 
substantially  below  normal.  (Data  from  Mukai  et  al.  1974.) 
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'The  increased  frequency  of  such  conditions  results 
from  the  genotype  frequencies  given  in  Table  4.3.  If  a denotes  a rare  dele- 
terious recessive  allele,  V16  then,  among  the  children  of  first-cousin  mat- 
ings, the  frequency  of  aa  is  q2(l  - Vu)+q  (V16)  because,  for  these  children, 
F = V16 , as  will  be  shown  in  the  next  section.  On  the  other  hand,  with  ran- 
dom mating,  the  frequency  of  recessive  homozygotes  is  q2.  Thus,  the  risk 
of  an  affected  offspring  from  a first-cousin  mating  relative  to  that  from  a 
mating  of  nonrelatives  is  given  by 


dati 


+ qiVu) 


= 0.9375  + 


0.0625 


1 


4.16 


For  example,  when  q = 0.01,  the  increased  risk  is  approximately  7;  that  is, 
a first-cousin  mating  has  seven  times  the  chance  of  producing  a homozygous 
recessive  child  as  compared  to  a mating  between  nonrelatives  when  the  fre- 
quency of  the  harmful  recessive  allele  is  0.01.  There  is  clearly  a dramatic 
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inbreeding  effect — and  the  rarer  the  frequency  of  the  deleterious  recessive 
allele,  the  greater  the  effect. 


PROBLEM  4.8  Relative  to  the  risk  with  random  mating,  calculate 
the  risk  of  a homozygous  recessive  offspring  from  a mating  of  second 
cousins  (F  = Vm)  when  the  recessive  allele  frequency  is  q = 0.01. 


ANSWER  In  general,  the  relative  risk  is  given  by  [q2(l  - F)  + qF]/q2 
= (1  - F)  + F/q.  For  F = X/M,  this  becomes  0.9844  + 0.0156 / q,  and  the 
value  for  q = 0.01  is  approximately  2.5. 

• . ' . =:  ! 


Calculation  of  the  Inbreeding  Coefficient  from  Pedigrees 

Computation  of  F from  a pedigree  is  simplified  by  drawing  the  pedigree  in 
the  form  shown  in  Figure  4.11A,  where  the  lines  represent  gametes  con- 
tributed by  parents  to  their  offspring.  The  same  pedigree  is  shown  in  con- 
ventional form  in  Figure  4.11B.  The  organisms  in  gray  in  part  B are  not  rep- 
resented in  part  A because  they  have  no  ancestors  in  common  and  therefore 
do  not  contribute  to  the  inbreeding  of  the  organism  denoted  I.  The  inbreed- 
ing coefficient  Ft  of  I is  the  probability  that  I is  autozygous  for  the  alleles  of 


Figure  4.1 1 (A)  Convenient  way  to  represent  pedigrees  for  calculation  of  the 

inbreeding  coefficient.  In  this  case,  the  pedigree  shows  a mating  between  half- 
first  cousins.  (B)  Conventional  representation  of  the  same  pedigree  as  in  part  A. 
Squares  represent  males,  circles  represent  females,  and  the  shaded  organisms  in 
part  B are  not  depicted  in  part  A because  they  do  not  contribute  to  the  inbreed- 
ing of  the  inbred  organism  designated  I. 
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an  autosomal  gene  under  consideration.  The  first  step  in  calculating  Fl  is  to 
locate  all  the  common  ancestors  in  the  pedigree,  because  an  allele  could 


parent  of*I.  These  paths  are  the  paths  along  which  an  allele  in  a common 
ancestor  could  become  autozygous  in  I.  In  Figure  4.11A,  there  is  only  one 
such  path:  DBACE,  in  which  the  common  ancestor  is  underlined  for  book- 
keeping purposes,  an  especially  useful  procedure  in  complex  pedigrees. 


dculate  the  probability  of  autpzy- 

f For  the  path  DBACE,  the  reason- 


ing is  illustrated  in  Figure  4.12.  Here  the  black  dots  represent  alleles 
transmitted  along  the  gametic  paths,  and  the  number  associated  with  each 
step  is  the  probability  of  identity  by  descent  of  the  alleles  indicated.  For  all 
steps  except  that  around  the  common  ancestor,  the  probability  is  V2  because, 
with  Mendelian  segregation,  the  probability  that  a particular  allele  present  in 
a parent  is  transmitted  to  a specified  offspring  is  V2.  To  understand  why 
V2(l  + Fa)  is  the  probability  associated  with  the  loop  around  the  common 
ancestor,  denote  the  alleles  in  the  common  ancestor  as  a,  and  a2.  These  sym- 
bols are  used  to  avoid  confusion  with  conventional  allele  symbols  designat- 
ing functional  types  of  alleles,  such  as  A for  dominant  and  a for  recessive. 
The  pair  of  gametes  contributed  by  A could  contain  0Ci(Xi,  a2a2,  a^,  or  ohOCj, 
each  with  a probability  of  V4  because  of  Mendelian  segregation.  In  the  first 
two  cases,  the  alleles  are  clearly  identical  by  descent;  in  the  second  two  cases. 


1/2(1  + Fa) 


Figure  4.12  Loops  for  the  pedigree  in  Figure  4. 11  A,  showing  probabilities  that 
designated  alleles  (solid  dots)  are  identical  by  descent.  Each  loop  is  independent 
of  the  others,  so  their  probabilities  multiply.  Thus,  the  inbreeding  coefficient 
of  organism  I is  Fj  = (i/2)5(l  + FA),  where  FA  represents  the  inbreeding  coefficient 
of  the  common  ancestor. 
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the  alleles  are  identical  by  descent  only  if  a!  and  oc2  are  already  identical  by 
descent,  which  means  that  A is  autozygous.  The  probability  that  A is  autozy- 
gous  is,  by  definition,  the  inbreeding  coefficient  of  A,  FA.  Hence,  the  proba- 
bility for  the  step  around  the  common  ancestor  A is  y4  + V4  + V4FA  + % FA  = !/2 
+ V2f  a = V2(l  + Fa)-  Because  each  of  the  steps  in  Figure  4.12  is  independent  of 
the  others,  the  total  probability  of  autozygosity  in  I due  to  the  path  through 
A is  y2  x y2  x y2(l  + Fa)  x y2  x y2/  or  (y2)5(l  + Fa).  Note  that  the  exponent  on  the 
y2  is  simply  the  total  number  of  ancestors  in  the  path.  In  general,  if  a path 
through  a common  ancestor  A contains  i individuals,  the  probability  of 
autozygosity  due  to  that  path  is 

(y2)'(l  + Fa) 

Thus,  the  inbreeding  coefficient  of  I in  Figure  4.11A  is  (y2)5(l  + FA). 
Assuming  that  A is  not  inbred  (FA  = 0),  the  inbreeding  coefficient  of  I reduces 

t°  (y2)5  = y32. 

In  pedigrees  of  greater  complexity,  there  is  more  than  one  common  ances- 
tor and  there  may  be  more  than  one  path  through  any  of  the  common  ances- 
tors. The  paths  are  mutually  exclusive  because  autozygosity  due  to  an  allele 
inherited  along  one  path  excludes  autozygosity  due  to  an  allele  inherited 
along  a different  path.  Thus,  the  total  inbreeding  coefficient  is  the  sum  of  the 
probabilities  of  autozygosity  due  to  each  path  considered  separately.  The 
whole  procedure  for  calculating  F is  summarized  in  an  example  of  a first- 
cousin  mating  in  Figure  4.13.  In  a first-cousin  mating,  there  are  two  common 


Pedigree  Paths:  GDACE 


GDBCE 


Contribution  to  Fy  (V2)S(1  + FA)  (l/2)5(l  + fB) 


Figure  4. 1 3 On  the  left  is  a pedigree  of  individual  I,  the  offspring  of  a first- 
cousin  mating.  On  the  right  are  the  two  paths  through  common  ancestors 
(heavy  lines)  used  in  calculating  the  inbreeding  coefficient  of  I.  Below  each  path 
is  the  contribution  to  F,  due  to  that  path,  calculated  as  in  Figure  4.12.  Each  path 
is  mutually  exclusive  of  the  others,  and  so  their  probabilities  add.  Thus,  the  total 
inbreeding  coefficient  of  I is  the  sum  of  the  two  separate  contributions.  If  FA  = 

Fb  = 0,  then  F,  = i/16. 
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ancestors  (A  and  B)  and  two  paths  (one  each  through  A and  B).  The  total 
inbreeding  coefficient  of  I is  the  sum  of  the  two  separate  contributions  shown 
in  Figure  4.13.  If  A and  B are  both  noninbred,  then  FA  = FB  = 0,  and  so  F:  = 
(V2)  + (V2)5  = Vi6;  this  result  is  the  probability  that  I is  autozygous  at  the  spec- 
ified locus.  Alternatively,  Fj  can  be  interpreted  as  the  average  proportion  of  all 
genes  in  I in  which  the  alleles  present  are  autozygous. 

In  general,  for  any  autosomal  gene,  the  formula  for  calculating  the 
inbreeding  coefficient  F,  of  an  inbred  organism  I is 


4.17 


in  which  the  summation  £ over  A means  summation  over  all  possible  paths 


ber  of  organisms  in  each  path. 


PROBLEM  4.9  The  accompanying  pedigree  depicts  two  generations 
of  brother-sister  mating.  Calculate  the  inbreeding  coefficient  of  I, 
assuming  that  none  of  the  common  ancestors  is  inbred.  (Altogether, 
there  are  four  common  ancestors  and  six  paths.) 


ANSWER^  F,  = (>/2)3(l  + Fc)  + (V2)3(l  + FD)  + (i/2)5(l  + fA)  + (i/2)s(l  + 
Fa)  + (V2)S(1  + ?b)  + (V2)5(l  + F„).  When  the  common  ancestors  are 
assumed  to  be  noninbred,  then  FA  = FB  = Fc  = FD  = 0,  and  so  F,  = %. 
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In  plant  and  animal  breeding,  it  is  often  important  to  know  how  rapidly  the 
inbreeding  coefficient  increases  when  a strain  is  propagated  by  a regular  sys- 
tem of  mating,  such  as  repeated  self-fertilization,  sib  mating,  or  backcrossing 
to  a standard  strain.  The  reasoning  involved  in  calculating  the  inbreeding 
coefficient  for  any  generation  is  illustrated  in  Figure  4.14  for  repeated  self- 
fertilization.  In  this  figure,  the  labels  t — 1 and  t refer  to  the  inbred  organisms 
after  t - 1 and  t generations  of  self-fertilization.  The  loop  around  the  ances- 
tor in  generation  t - 1 designates  the  probability  that  the  two  indicated  alleles 
are  identical  by  descent.  Here  the  formula  in  Equation  4.17  applies  with  only 
one  path  and  only  one  ancestor  in  the  path,  and  so  F,  = (1/2)1(1  + Ft  ,),  where 
F,  is  the  inbreeding  coefficient  in  generation  t.  This  equation  is  easy  to  solve 
in  terms  of  the  quantity  1 - Ft,  whichi^often  called  the  panmictic  index, 
panmixia  being  a synonym  for  random  mating.  Multiplying  both  sides  of  the 
equation  for  F,  by  -1  and  then  adding  +1  to  each  side  leads  to  1 - Ft  = 

i - y2(i  + Ft_i)  = i - y2  - y2Ff_j  = y2(i  - fm),  or 


4.18 


efficient  in  the  initial  generation  when  the 
Self-fertilization  therefore  leads  to  an 
extremely  rapid  increase  in  the  inbreeding  coefficient.  When  F0  = 0,  then 

Fi  = y2,  ?2  = :U,  Fj  = %,  F4  = 1S/16,  and  so  on.  The  increase  in  F under  self- 
fertilization  and  several  other  regular  systems  of  mating  is  shown  in  Figure 
4.15. 


Many  plants  reproduce  predominantly  by  self-fertilization,  including 
crop  plants  such  as  soybeans,  sorghum,  barley,  and  wheat.  As  expected  of 


1/2(1  +T-,) 


Figure  4.14  Increase  in  F resulting  from  continued  self-fertilization.  The 
organism  in  generation  t is  the  offspring  of  self-fertilization  of  the  organism  in 
generation  t - 1 . The  loop  shows  that  F,  = y2(l  + F,  _ ,). 
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Figure  4.1 5 Theoretical  increase  in  the  inbreeding  coefficient  F for  regular 
systems  of  mating:  selfing,  sib  mating,  half-sib  mating,  and  repeated  backcross- 
ing  to  a single  organism  from  a random-bred  strain.  In  each  case,  the  initial 
value  of  F is  assumed  to  be  F0  = 0. 


highly  self-fertilizing  species,  each  plant  is  highly  homozygous  for  alleles 
such  as  those  determining  allozymes.  Yet  the  proportion  of  polymorphic 
genes  is  comparable  to  that  found  in  outcrossing  species.  Polymorphisms  are 
found  because  self-fertilization  does  not  eliminate  genetic  variation;  it  simply 
reorganizes  genetic  variation  into  homozygous  genotypes.  On  the  other 
hand,  self-fertilizing  species  do  contain  fewer  deleterious  recessives  than  do 
outcrossing  species,  presumably  because  the  increased  homozygosity  per- 
mits harmful  recessives  to  be  eliminated  from  the  population  by  natural 
selection.  One  other  important  point  about  naturally  self-fertilizing  species: 
The  high  homozygosity  of  all  genes  implies  that  recombination  rarely  results 
in  new  gametic  types  not  already  present  in  the  parent.  Therefore,  predomi- 
nance of  selfing  has  the  effect  of  retarding  the  approach  to  linkage  equilibri- 
um because  the  approach  to  linkage  equilibrium  is  through  recombination 
in  double  heterozygotes  ( AB/ab  and  Ab/aB  in  the  case  of  two  alleles  at  each 
locus);  with  extreme  inbreeding,  such  double  heterozygotes  are  rare.  Indeed, 
the  most  extreme  examples  of  linkage  disequilibrium  have  been  found  in  pre- 
dominantly self-fertilizing  species  such  as  barley  ( Hordeum  vulgare)  and  wild 
oats  ( Avena  barbata). 

Barley,  which  regularly  undergoes  more  than  99%  self-fertilization,  pro- 
vides an  extreme  example  of  linkage  disequilibrium  between  two  unlinked 
esterase  genes  (Clegg  et  al.  1972).  A population  that  had  originated  as  a com- 
plex cross  was  maintained  for  26  generations  under  normal  agricultural  con- 
ditions without  conscious  selection.  The  population  was  polymorphic  for 
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two  alleles  B:  and  B2  of  an  Esterase-B  gene  and  also  polymorphic  for  two  alle- 
les Dj  and  D2  of  an  Esterase-D  gene.  The  gametic  types  were  found  in  the  fol- 
lowing proportions.  For  all  practical  purposes,  these  numbers  also  refer  to 
homozygous  genotypes  because  there  is  such  close  inbreeding. 


BiDj 

1501 

(1642.6) 

B]D2 

754 

(613.7) 

B2E>i 

720 

(577.1) 

B2D2 

74 

(215.6) 

(The  numbers  in  parentheses 

are  the 

expected  numbers  based  on  the 

assumption  of  linkage  equilibrium,  calculated  as  in  Chapter  3.)  The  %2  value 
in  this  case  is  172.7  with  one  degree  of  freedom.  The  associated  probability 
is  much  less  than  0.0001,  and  so  there  is  undoubtedly  linkage  disequilibri- 
um. For  the  above  data,  the  linkage  disequilibrium  parameter  (Equation  3.9) 
is  D = -0.046,  which  is  about  66%  of  its  theoretical  minimum. 

One  of  the  dramatic  successes  of  plant  breeding  has  come  from  the  crossing 
of  inbred  lines  to  produce  high-yielding  hybrid  corn.  Yield  of  a genetically 
heterogeneous,  outcrossing  variety  of  corn  can  be  improved  by  selecting  the 
plants  with  the  highest  yields  in  each  generation  to  be  the  progenitors  of  the 
next  generation;  such  artificial  selection  results  in  only  gradual  improvement, 
however  (see  Chapter  9).  If  a large  number  of  self-fertilized  lines  are  estab- 
lished from  a heterogeneous  population,  each  line  declines  in  yield  as  inbreed- 
ing proceeds,  owing  to  the  forced  homozygosity  of  deleterious  recessives. 
Many  lines  become  so  inferior  that  they  have  to  be  discontinued.  Self-fertilized 
lines  are  not  likely  to  become  homozygous  for  exactly  the  same  set  of  deleteri- 
ous recessives,  however,  and  when  different  lines  are  crossed  to  produce  a 
hybrid,  the  hybrid  becomes  heterozygous  for  these  genes.  Alleles  favoring  high 
yield  in  corn  are  generally  dominant,  and  there  may  also  be  genes  in  which  the 
heterozygous  genotypes  have  a more  favorable  effect  on  yield  than  do  the 
homozygous  genotypes;  in  any  case,  the  hybrid  has  a much  higher  yield  than 
either  inbred  parent.  The  phenomenon  of  enhanced  hybrid  performance  is 
called  hybrid  vigor  or  heterosis.  In  practice,  inbred  lines  are  crossed  in  many 
combinations  to  identify  those  that  produce  the  best  hybrids.  Yields  of  hybrid 
corn  are  typically  15  to  35%  greater  than  yields  of  outcrossing  varieties,  and  the 
successful  introduction  of  hybrid  corn  has  been  remarkable.  Virtually  all  corn 
acreage  in  the  United  States  today  is  planted  with  hybrids,  as  compared  to 
0.4%  of  the  acreage  in  1933  (Sprague  1978). 

ASSORTATIVE  MATING 

When  choice  of  mates  is  based  on  phenotypes,  mating  is  said  to  be  assorta- 
tive.  Most  assortative  mating  is  positive  assortative  mating ; this  term  means 
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that  mating  pairs  have,  on  the  average,  more  similar  phenotypes  than  expect- 
ed with  random  mating.  The  qualifier  "on  the  average"  is  important.  Even 
when  mating  is  random,  some  mating  pairs  are  phenotypically  similar,  and 
so  positive  assortative  mating  refers  only  to  those  situations  in  which  mating 
partners  are  phenotypically  more  similar  than  would  be  expected  by  chance 
encounters. 

There  are  also  examples  of  negative  assortative  mating — sometimes  called 
disassortative  mating — in  which  mating  pairs  are  more  dissimilar  than  expect- 
ed by  chance.  One  case  of  negative  assortative  mating  is  a polymorphism 
known  as  heterostyly  found  in  most  species  of  primroses  ( Primula ) and  their 
relatives.  The  heterostyly  polymorphism  refers  to  the  relative  lengths  of  the 
styles  and  stamens  in  the  flowers  (Figure  4.16).  (In  botanical  terminology,  the 
style  is  a stalk  bearing  the  stigma,  which  is  the  female  organ  that  receives 
pollen;  the  stamen  is  the  male  organ  bearing  anthers,  in  which  the  pollen  is 
produced.)  Most  populations  of  primroses  contain  approximately  equal  pro- 
portions of  two  types  of  flowers,  one  known  as  pin,  which  has  a tall  style  and 
short  stamens,  and  the  other  known  as  thrum,  which  has  a short  style  and  tall 
stamens.  In  heterostyly,  insect  pollinators  that  work  high  on  the  flowers  pick 
up  mostly  thrum  pollen  and  deposit  it  on  pin  stigmas,  whereas  pollinators 
that  work  low  in  the  flowers  pick  up  mostly  pin  pollen  and  deposit  it  on 


(A)  Pin  (B)  Thrum 


Figure  4.16  Diagrams  of  cross  sections  of  (A)  pin  and  (B)  thrum  flowers  of  the 
primrose.  Primula.  The  pin  flowers  have  a long  style  and  short  stamens;  the 
thrum  flowers  have  a short  style  and  long  stamens.  The  differences  in  flower 
morphology  assist  in  the  maintenance  of  negative  assortative  mating  mediated 
by  insect  pollinators. 
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thrum  stigmas.  Negative  assortative  mating  therefore  takes  place  because 
pins  mate  preferentially  with  thrums.  Additional  floral  adaptations  facilitate 
the  negative  assortative  mating.  For  example,  pollen  grains  from  pin  flowers 
fit  the  receptor  cells  of  thrum  stigmas  better  than  they  do  their  own,  and 
pollen  grains  from  thrum  flowers  germinate  better  on  pin  stigmas  than  they 
do  on  their  own. 

The  pollination  biology  of  flowering  plants  also  provides  examples  of 
positive  assortative  mating.  For  example,  when  the  length  of  time  in  which 
any  plant  flowers  is  short  relative  to  the  total  duration  of  the  flowering  sea- 
son, then  plants  that  flower  early  in  the  season  are  preferentially  pollinated 
by  other  early  flowering  plants,  and  those  that  flower  late  are  preferentially 
pollinated  by  other  late  flowering  ones.  Thus,  there  is  positive  assortative 
mating  for  flowering  time. 

In  human  beings,  positive  assortative  mating  is  observed  for  height,  IQ 
score,  and  certain  other  traits,  although  assortative  mating  varies  in  degree  in 
different  populations  and  is  absent  in  some.  As  might  be  expected,  positive 
assortative  mating  is  found  for  certain  socioeconomic  variables.  In  one  study 
in  the  United  States,  the  highest  correlation  found  between  married  couples 
was  in  the  number  of  rooms  in  their  parents'  homes.  Negative  assortative 
mating  is  apparently  quite  rare  in  human  populations. 

In  certain  species  of  Drosophila,  a curious  type  of  nonrandom  mating  is  a 
phenomenon  called  minority  male  mating  advantage,  in  which  females  mate 
preferentially  with  males  with  rare  phenotypes.  For  example,  in  a study  of 
experimental  populations  of  D.  pseudoobscura  containing  flies  homozygous 
for  either  a recessive  orange  eye-color  mutation  or  a recessive  purple  eye-color 
mutation,  Ehrman  (1970)  found  that,  when  20%  of  the  males  were  orange,  the 
orange-eyed  males  participated  in  30%  of  the  observed  matings;  conversely, 
when  20%  of  the  males  were  purple,  the  purple-eyed  males  participated  in 
40%  of  the  observed  matings. 

The  consequences  of  positive  assortative  mating  are  complex.  They 
depend  on  the  number  of  genes  that  influence  the  trait  in  question,  on  the 
number  of  different  possible  alleles  of  the  genes,  on  the  number  of  different 
phenotypes,  on  the  sex  performing  the  mate  selection,  and  on  the  criteria  for 
mate  selection.  Traits  for  which  mating  is  assortative  are  rarely  determined 
by  the  alleles  of  a single  gene,  however.  Most  such  traits  are  polygenic,  so  rea- 
sonably realistic  models  of  assortative  mating  tend  to  be  rather  complex. 
Here  we  should  note  one  obvious,  qualitative  consequence  of  positive  assor- 
tative mating:  since  like  phenotypes  tend  to  mate,  assortative  mating  gener- 
ally increases  the  frequency  of  homozygous  genotypes  in  the  population  at 
the  expense  of  heterozygous  genotypes,  and  thus  the  phenotypic  variance  in 
the  population  increases.  (Negative  assortative  mating  generally  has  the 
opposite  effect.) 
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SUMMARY 

Species  that  are  spread  over  a large  geographical  area  are  usually  divided 
into  subpopulations.  Matings  between  organisms  within  the  same  subpopu- 
lation are  more  likely  than  matings  between  organisms  in  different  subpop- 
ulations. Geographical  subdivision  of  a population  is  called  population  sub- 
structure. The  genetic  consequences  of  population  substructure  result  from 
the  fact  that  the  frequencies  of  alleles  may  differ  from  one  subpopulation  to 
the  next.  When  the  allele  frequencies  differ,  the  average  heterozygosity 
among  the  subpopulations  is  smaller  than  that  expected  with  random  mat- 
ing in  the  total  population.  Many  populations  are  subdivided  into  groups 
within  larger  groups,  a kind  of  structure  called  a hierarchical  population 
structure.  The  F statistics  are  a quantitative  measure  of  the  reduction  in  het- 
erozygosity at  various  levels  in  a population  hierarchy.  For  example,  FSR  is 
the  proportionate  reduction  in  average  heterozygosity  among  subpopula- 
tions (S)  as  compared  to  that  expected  with  HWE  within  regions  (R): 
F sr  = (Hr  - Hs)/Hr.  Similarly,  FRT  is  the  proportionate  reduction  in  average 
heterozygosity  among  regions  (R)  as  compared  to  that  expected  with  HWE 
in  the  total  population  (T):  FRT  = (HT  - HR)/HT.  The  fixation  index  Fg x com- 
bines the  effects  due  to  subdivision  into  subpopulations  within  regions  and 
regions  within  the  total  population:  FST  = (HT  - HS)/HT.  Generally  speaking, 
an  F statistic  with  a value  smaller  than  0.05  indicates  little  genetic  differenti- 
ation; a value  from  0.05  to  0.15  indicates  moderate  genetic  differentiation, 
from  0.15  to  0.25  indicates  great  genetic  differentiation,  and  above  0.25  indi- 
cates very  great  genetic  differentiation  among  subpopulations. 

When  subpopulations  undergo  fusion  and  random  mating,  the  deficien- 
cy of  heterozygotes  is  eliminated.  Said  another  way  around,  the  excess  of 
homozygous  genotypes  in  a subdivided  population  is  eliminated  by  popu- 
lation fusion  and  random  mating.  This  effect  of  population  fusion  is  called 
the  Wahlund  principle.  Quantitatively,  the  Wahlund  principle  implies  that 
population  fusion  and  random  mating  will  cause  a reduction  in  the  frequen- 
cy of  any  homozygous  genotype  by  an  amount  equal  to  the  variance  in  allele 
frequency  among  the  original  subpopulations.  For  two  alleles,  the  Wahlund 
effect  is  related  to  the  fixation  index  by  the  relation  FST  = c r/(p  x ~c\ ).  In  terms 
of  the  fixation  index,  the  average  genotype  frequencies  across  subpopula- 
tions are:  AA  with  average  frequency  p2(  1 - FST)  + pFSJ,  Aa  with  average  fre- 
quency 2 pq  (1  - Fst),  and  aa  with  average  frequency  7f{  1 - FST)  + qFsr. 
Despite  the  departure  from  HWE  when  genotype  frequencies  are  averaged 
across  subpopulations,  within  each  subpopulation  mating  is  random  and 
the  genotype  frequencies  are  in  HWE  for  the  allele  frequencies  in  the  sub- 
population. 

Inbreeding  means  mating  between  relatives.  The  most  important  effect 
of  inbreeding  is  that  replicas  of  a single  allele  in  a common  ancestor  may  be 
transmitted  down  both  sides  of  the  pedigree  and  come  together  in  fertiliza- 
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tion  to  produce  the  inbred  organism.  In  such  a case,  the  inbred  organism  is 
said  to  be  autozygous,  and  the  alleles  are  identical  by  descent  (IBD).  Other- 
wise the  inbred  organism  is  allozygous.  The  inbreeding  coefficient  F is  the 
probability  that  the  two  homologous  genes  in  an  inbred  organism  are  IBD. 
With  close  inbreeding  among  parents  with  relatively  recent  common  ances- 
tors, the  value  of  F can  be  calculated  from  elementary  probability  considera- 
tions using  the  formula  F = 2 (V2)'(  1 + FA),  where  the  summation  is  over  all 
paths  from  one  parent  to  the  other  through  each  common  ancestor,  i is  the 
number  of  organisms  in  the  path,  and  FA  is  the  inbreeding  coefficient  of  the 
common  ancestor  in  the  path.  Among  organisms  in  which  the  inbreeding 
coefficient  is  F,  the  genotype  frequencies  of  a gene  with  two  alleles  are,  for 
AA,  p2(  1 - F)  + pF;  for  Aa,  2pq(l  - F);  and  for  aa,  q2(  1 - F)  + qF.  Hence,  one  of 
the  most  important  consequences  of  close  inbreeding  is  an  increased  risk  of 
homozygosity  of  rare  recessive  alleles — q2(  1 - F)  + qF  for  inbred  organisms 
versus  q2  for  noninbred  organisms.  In  human  populations,  a substantial  pro- 
portion of  children  affected  with  rare,  homozygous  recessive  genetic  diseases 
have  first-cousin  parents,  although  first-cousin  mating  is  infrequent. 

Population  substructure  results  in  an  accumulation  of  inbreeding  because 
mating  pairs  within  subpopulations  will  often  have  remote  relatives  in  com- 
mon, even  when  mates  are  chosen  at  random.  Thus,  the  inbreeding  coeffi- 
cient F resulting  from  nonrandom  mating  within  a subpopulation  should  be 
designated  FIS.  The  total  inbreeding  resulting  from  nonrandom  mating 
combined  with  all  levels  of  population  substructure  is  given  by  the  expres- 
sion (1  - Fit)  = (1  - FIS)  x (1  - Fst). 


PROBLEMS 

1.  Two  diploid  random  mating  populations  have  allele  frequencies  q + e and 
q - e for  a recessive  allele  of  a gene.  What  are  the  frequencies  of  homozy- 
gous recessives  before  and  after  population  fusion? 

2.  Show  that  FIT  = FIS  + FST  - FISFST  and  interpret  the  expression. 

3.  Calculate  FST  among  the  three  random-mating  populations  below  based 
on  the  specified  allele  frequencies.  What  is  the  maximum  value  of  FST  in 
this  situation? 


Population 

Population  1 

Population  2 

Population  3 

Allele  1 

0.1 

0.2 

0.3 

Allele  2 

0.3 

0.3 

0.3 

Allele  3 

0.6 

0.5 

0.4 

4.  Calculate  FIS,  FST,  and  FIX  for  the  populations  with  the  genotype  frequen- 
cies shown  in  the  following  table: 


Population  7 Population  2 


Genotype  AA 
Aa 


0.056 

0.288 

0.656 


0.072 

0.256 

0.672 


aa 


5.  Suppose  two  subpopulations  with  equal  allele  frequencies  of  two  linked 
genes  have  an  amount  of  linkage  disequilibrium  that  is  equal  but  opposite 
in  sign.  What  is  the  amount  of  linkage  disequilibrium  in  a population 
formed  by  mixing  equal  numbers  of  individuals  from  the  two  populations? 

6.  Show  that  p2(l  - F)  + pF  = p2  + pqF  = p - ( 1 - F)pq,  when  q=l-p. 

7.  With  two  alleles  and  p = V2,  what  are  the  expected  genotype  frequencies 
in  a random  mating  population  and  among  the  offspring  of  first  cousins? 
How  great  is  the  decrease  in  heterozygosity  in  the  inbred  population  rel- 
ative to  the  random  mating  population? 

8.  If  the  frequency  of  an  autosomal  recessive  disorder  is  1/1600  among 
unrelated  parents,  what  is  the  expected  frequency  among  the  offspring  of 
first  cousins? 

9.  For  a recessive  allele  at  frequency  q in  a population  in  which  one  percent 
of  the  matings  are  between  first  cousins,  but  otherwise  occur  at  random, 
the  proportion  of  affected  individuals  having  first-cousin  parents  is 
(1  + 15<7)/(1  + 1599 q).  Calculate  for  q = 0.1,  0.05,  0.1,  0.005,  and  0.001. 
Interpret  the  result  of  the  equation  when  q = 1 . 

10.  In  a population  of  monoecious  plants  in  Hardy-Weinberg  proportions  for 
two  alleles  with  allele  frequency  p,  what  is  the  variance  in  allele  frequen- 
cy among  plants?  What  is  the  variance  if  the  population  were  completely 
inbred?  If  a random  mating  population  were  to  undergo  self-fertilization, 
what  would  the  variance  be  when  the  inbreeding  coefficient  equals  F? 

11 . The  measure  of  genetic  divergence  GST  is  very  useful  for  multiple  alleles 
in  multiple  subpopulations.  GST  can  be  defined  as  (Js  - /T)/(  1 - /T),  where 
p,  is  the  frequency  of  the  z'th  allele,  }s  = lAvg (p2)  and  fT  = I[Avg(p,)]2  (Nei 
1987).  The  summation  means  summation  over  all  alleles,  and  Avg  means 
the  average  over  all  subpopulations.  For  the  random  mating  populations 
below,  calculate  FST  and  GSr. 


Population  1 Population  2 


Allele  1 
Allele  2 
Allele  3 


0.2 

0.3 

0.5 


0.6 

0.0 

0.4 


12.  Gst  for  multiple  alleles  is  actually  a weighted  average  of  FST  values, 
Gst  = lpi(l  - pi)Fsm/lpi(l  - pX  where  the  summation  is  over  all  alleles. 
Pi  is  the  average  frequency  of  the  z'th  allele  among  the  subpopulations, 
and  FST(,)  is  the  FST  value  for  the  z'th  allele  calculated  as  if  the  gene  had 
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only  two  alleles  with  frequencies  p,  and  1 - p,  in  each  subpopulation.  Cal- 
culate Fs t(,)  for  each  allele  in  the  preceding  problem  and  confirm  numeri- 
cally that  the  weighted  average  equals  GST. 

13.  In  calculating  F from  pedigrees  for  X-linked  genes,  why  are  paths  with 
two  or  more  consecutive  males  not  counted? 

14.  What  is  the  coefficient  of  relationship  between  I and  / in  the  accompany- 
ing pedigree,  where  I and  / are  the  offspring  of  a pair  of  first  cousins 
(A,  B ) mated  with  another  pair  of  first  cousins  (C,  D)? 


15.  Assuming  FA  = FB  = 0,  calculate  the  inbreeding  coefficient  for  each  of  the 
individuals  C-I  in  the  accompanying  pedigree. 

(A)  (B) 


.g)  (h; 


Cl) 

16.  If  a population  is  maintained  by  self-fertilization  in  even-numbered  gen- 
erations and  by  random  mating  in  odd-numbered  generations,  what  hap- 
pens to  the  inbreeding  coefficient? 

17.  For  a gene  with  two  alleles  and  p = 0.3,  what  are  the  expected  genotype 
frequencies  after  five  generations  of  sib  mating?  What  are  the  expected 
genotype  frequencies  after  one  additional  generation  of  random  mating? 

18.  What  is  the  inbreeding  coefficient  in  a population  of  size  50  that  under- 
goes 

a.  47  generations  of  random  mating  followed  by  three  generations  of  sib 
mating? 

b.  50  generations  of  random  mating? 

19.  In  gametophytic  self-incompatibile  plants,  the  pollen  can  only  fertilize 
ovules  whose  genotype  has  neither  allele  borne  by  the  haploid  pollen.  In 
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a plant  population  at  equilibrium  with  three  gametophytic  self-incom- 
patibility alleles,  what  is  the  probability  that  a pollen  grain  will  land  on  a 
compatible  style? 

20.  Two-way  hybrid  corn  is  produced  by  crossing  two  different  inbred  lines; 
three-way  hybrids  are  produced  by  crossing  a two-way  hybrid  with  an 
unrelated  inbred;  and  four-way  hybrids  are  produced  by  crossing  two 
different  two-way  hybrids.  What  is  the  inbreeding  coefficient  of  the  off- 
spring  of  randomly  mated  two-way,  three-way,  or  four-way  hybrids? 
(Hint:  Consider  the  allele  frequencies  in  gametes.) 

21.  Derive  a recursion  equation  for  F,  for  repeated  parent-offspring  mating 
(see  pedigree),  and  calculate  F,  for  t = 0 to  5. 


22.  Derive  a recursion  equation  for  F,  for  repeated  backcrossing  to  a single 
noninbred  individual  A (see  pedigree).  Calculate  F,  for  t = 0 to  5 and  the 
equilibrium  value. 


CHAPTER  5 


Sources  of  Variation 


Mutation  Infinite  Alleles  Model  Neutral  Mutations 
Recombination  Migration  Transposable  Elements 


G 


enetics  includes  several  processes  that  create  new  types  of  genet- 
ic variation  in  populations  or  that  allow  for  the  reorganization  of 
previously  existing  variation  either  within  genomes  or  among 
subpopulations.  The  ultimate  source  of  genetic  variation  is  mutation,  by 
which  we  mean  any  heritable  change  in  the  genetic  material.  Mutation  there- 
fore includes  a change  in  the  nucleotide  sequence  of  a single  gene  as  well  the 
formation  of  a chromosome  rearrangement,  such  as  an  inversion  or  a translo- 
cation. Recombination  brings  mutations  of  different  genes  together  into  the 
same  chromosome.  Migration  enables  mutations  to  spread  among  subpopu- 
lations. A transposable  element  is  a DNA  sequence  able  to  replicate  and 
insert  into  any  of  a large  number  of  sites  in  the  genome.  By  insertion  in  or 
near  a gene,  a transposable  element  can  alter  the  level  or  pattern  of  gene 
expression;  recombination  between  transposable  elements  can  result  in  a 
chromosome  rearrangement,  for  example,  an  inversion.  In  this  chapter,  we 
consider  the  processes  by  which  genetic  variation  is  created. 


MUTATION 

Mutation  is  the  ultimate  source  of  genetic  variation  for  evolutionary  change. 
However,  most  wildtype  genes  mutate  at  a very  low  rate,  typically  in  the 
range  from  10  4 to  10  6 new  mutations  per  gene  per  generation.  Even  a low 
mutation  rate  can  create  many  new  mutant  alleles  because,  in  a large  popu- 
lation, each  of  a large  number  of  genes  is  at  risk  of  mutating.  In  a population 
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of  size  N diploid  organisms,  there  are  2 N copies  of  each  gene,  each  of  which 
can  mutate  in  any  generation.  Mutations  are  rare,  but  in  a large  population 
there  are  many  alleles  at  risk.  For  example,  if  the  mutation  rate  (probability 
of  mutation)  is  10  4 per  nucleotide  pair  per  generation,  then  in  each  human 
gamete,  the  DNA  of  which  contains  10q  nucleotide  pairs,  there  would  be  an 
average  of  three  new  mutations  in  each  generation;  each  newly  fertilized  egg 
would  carry,  on  the  average,  six  new  mutations.  The  present-day  human 
population  of  approximately  6 billion  people  would  therefore  be  expected  to 
carry  approximately  36  billion  new  mutations  that  were  not  present  even  one 
generation  earlier. 

Irreversible  Mutation 

Although  mutation  may  create  a new  allele,  the  initial  frequency  of  the 
mutant  allele  must  be  very  small  if  the  population  size  is  large.  A single  new 
mutant  allele  in  a diploid  population  of  size  N has  an  initial  frequency  of 
1/2N.  New  mutations  in  subsequent  generations  may  augment  the  number 
of  mutant  alleles,  but  recurrent  mutation  alone  increases  the  allele  frequen- 
cy of  the  mutant  very  slowly.  Consider  an  example  in  which  A is  the  wild- 
type  allele  and  a the  mutant  form.  If  there  is  exactly  one  new  mutation  per 
generation,  then  the  allele  frequency  of  a increases  according  to  the  series 
1/2N,  2/2 N,  3/2 N,  . . . and,  if  N is  large  (for  example,  N = 106),  then  the 
increase  is  very  slow  indeed.  Hence,  the  tendency  for  allele  frequency  to 
change  as  a result  of  recurrent  mutation  (mutation  pressure)  is  very  small. 
On  the  other  hand,  the  cumulative  effects  of  mutation  over  long  periods  of 
time  can  become  appreciable. 

A useful  model  for  thinking  about  mutation  is  the  Hardy-Weinberg  model 
of  Chapter  3,  but  with  mutation  permitted.  For  the  moment,  we  focus  on  muta- 
tions that  have  so  little  effect  on  the  ability  of  the  organism  to  survive  and 
reproduce  that  natural  selection  does  not  appreciably  influence  their  frequen- 
cy. We  will  also  assume  that  mutation  is  irreversible,  which  means  that  a cannot 
reverse-mutate  to  A.  To  avoid  complications  resulting  from  change  in  allele  fre- 
quency due  to  chance,  we  will  assume  a population  that  is  infinite  in  size. 

Consider  a gene  with  two  alleles,  A and  a,  and  suppose  that  A mutates  to 
a at  a rate  of  p mutations  per  A allele  per  generation.  In  other  words,  each  A 
allele  has  a probability  of  p of  mutating  to  a in  any  generation.  We  will  sym- 
bolize the  allele  frequency  of  A as  p and  that  of  a as  q and  keep  track  of  gen- 
erations with  subscripts.  Hence,  p,  and  q,  are  the  allele  frequencies  of  A and  a, 
respectively,  in  the  fth  generation,  where  t = 0, 1,  2, ....  In  any  generation! 
Pt  + ‘h  = 1 because  A and  a are  the  only  alleles  considered. 

Next  we  will  deduce  a formula  for  the  allele  frequency  p,  in  terms  of  the 
allele  frequency  in  the  previous  generation.  In  generation  t,  p,  includes  all 
the  A alleles  in  generation  t that  did  not  mutate  in  that  generation,  and  so 

Pt  = Pt-\  x (1  - p) 
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However,  by  the  same  reasoning,  pt_t  includes  all  A alleles  in  generation 
t - 1 that  did  not  mutate  in  that  generation,  and  so  p(_i  = P1-2  x (1  - p).  Sub- 
stituting this  equation  into  the  one  above  yields 

Pi  = Pi- 2 x (1  - h)2 

Continuing  in  the  same  manner  leads  eventually  to 

p,  = p0(l-]i)1  5.1 

The  effect  of  mutation  pressure  on  allele  frequency  is  illustrated  in  Figure 
5.1  for  the  case  p = 10  4.  The  allele  frequency  of  A decreases  very  slowly, 
almost  linearly  at  first  because  the  governing  term  in  Equation  5.1,  (1  - p)f,  is 
approximated  by  1 - pf  when  t is  sufficiently  small.  After  1000  generations, 
the  allele  frequency  of  A is  still  0.90;  however,  at  t = 10,000  generations, 
p,  = 0.37;  and  at  t = 20,000  generations,  p,  = 0.14. 

One  instructive  way  to  analyze  Equation  5.1  is  to  consider  the  time 
required  to  reduce  the  allele  frequency  of  A by  half.  To  find  the  “half-life"  of 
the  process,  set  p,  = 0.5  x p0;  this  relationship  implies  that  0.5  = (1  - p)f.  Taking 
logarithms  of  both  sides,  we  obtain 

h/2  = In  (0.5)/ln  (1  - p)  = 0.6931/p 

In  the  example  in  Figure  5.1,  fj/ 2 = 6931  generations.  A decrease  in  p by  a 
factor  of  10  increases  t1/2  accordingly,  to  approximately  69,310  generations 
for  p = 10  s and  to  approximately  693,100  generations  for  p = 10“6.  The  fact 


Figure  5.1  Change  in  frequency  under  mutation  pressure.  In  this  example,  an 
allele  A mutates  to  a at  a rate  of  p = 1 x 10  4 per  generation;  p,  is  the  allele  fre- 
quency of  A in  generation  t.  We  assume  that  p0=  1.  With  the  given  value  of  p, 
the  allele  frequency  decreases  by  half  every  6931  generations. 
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that  mutation  pressure  is  a weak  force  for  changing  allele  frequency  is  illus- 
trated by  the  long  half-lives  calculated  for  realistic  values  of  the  mutation  rate. 

As  noted  with  reference  to  Equation  5.1,  the  approximation  p,  = p0(  1 - p f) 
is  quite  accurate  for  small  values  of  t.  With  respect  to  the  allele  frequency  of 
the  mutant  allele  a,  the  approximation  can  also  be  written  as  q,  = q0  + p t, 
provided  that  q0  is  small.  This  approximation  implies  that  the  allele  fre- 
quency of  the  a allele  increases  linearly  with  time  with  a slope  equal  to  p. 
Because  p is  small,  however,  the  linear  increase  in  qt  is  difficult  to  detect 
experimentally  except  in  very  large  populations.  A large  population  size 
can  be  attained  in  a bacterial  chemostat,  which  is  a device  for  maintaining 
a population  of  bacteria  in  a continuous  state  of  growth  and  cell  division 
(Figure  5.2).  The  linear  increase  in  qt  from  mutation  pressure  observed  in  a 


Nutrient  medium 
input 


Overflow 

siphon 


Figure  5.2  Diagram  of  a bacterial  chemostat.  Nutrient  medium  drips  in  at  the 
top,  but  a constant  volume  is  maintained  by  means  of  an  overflow  siphon.  The 
air  coming  in  at  the  bottom  provides  oxygen.  At  the  steady  state,  the  rate  of 
inflow  of  nutrient  equals  the  rate  of  outflow.  Cells  within  the  chemostat  are  in  a 
continuous  state  of  division,  but  the  population  does  not  increase  in  size 
because,  in  any  interval  of  time,  the  number  of  new  cells  produced  by  division 
is  balanced  by  the  number  washed  out  through  the  siphon. 
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chemostat  is  shown  in  Figure  5.3.  Note  the  abrupt  increase  in  mutation  rate 
(indicated  by  the  increase  in  slope)  shortly  after  the  addition  of  caffeine,  a 
bacterial  mutagen. 


Time  ((,  in  generations) 


Figure  5.3  Estimation  of  mutation  rate  in  a bacterial  chemostat.  This  exam- 
ple concerns  the  rate  of  mutation  of  a gene  in  Escherichia  coli  that  confers  resis- 
tance to  infection  by  the  bacteriophage  T5.  The  frequency  q,  is  the  frequency  of 
T5-resistant  cells  after  t generations  of  growth.  The  mutation  rate  is  estimated 
as  the  slope  of  the  straight-line  segments.  Prior  to  the  addition  of  caffeine, 
the  slope  was  p = 7.2  x 1CT8  per  generation.  After  addition  of  caffeine  at  a con- 
centration of  150  mg/1,  the  slope  increased  about  tenfold  to  p = 66  x 10“8  per 
generation.  In  this  experiment,  the  generation  time  was  5.5  hours.  (From 
Novick  1955.) 


PROBLEM  5.1  A genetic  factor  has  been  described  in  Drosophila 
mauritiana  that  results  in  the  spontaneous  deletion  of  the  transpos- 
able  genetic  element  mariner  at  a frequency  of  approximately  one 
percent  per  generation  for  each  copy  (Bryan  et  al.  1987).  In  a popu- 
lation containing  an  autosomal  site  at  which  a mariner  insertion  is 
fixed  (homozygous),  how  many  generations  would  be  required  for 
the  frequency  of  flies  that  are  homozygous  for  a deletion  of  the 
element  to  exceed  five  percent?  Assume  that  the  population  is 
large,  that  mating  is  random,  that  the  excision  factor  is  fixed,  and 
that  deletion  of  the  element  does  not  affect  survival  or  repro- 
duction. 
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ANSWER  Let  p,  be  the  frequency  of  chromosomes  in  which  the 
mariner  element  remains  undeleted  in  generation  f,  and  let  p = 0.01  be 
the  probability  of  deletion  of  the  element  per  generation.  For  this  situ- 
ation, Equation  5.1  applies  with  p = 0.01  and  po  = 1.  The  frequency  of 
deletion  homozygotes  is  greater  than  five  percent  when  (1  - pt)2  > 0.05, 
or  Pt  < 1 ~ (-05)1/2=  0.776.  Thus,  f should  be  greater  than  ln(0.776)/ 
ln(0.99)  = 25.2  generations. 


Reversible  Mutation 

In  this  section,  in  addition  to  forward  mutation  of  A to  a,  we  also  allow 
reverse  mutation  from  a to  A.  In  this  case,  the  mutation  pressure  on  the  allele 
frequency  p is  in  both  directions:  forward  mutation  tends  to  decrease  p, 
reverse  mutation  tends  to  increase  p.  Eventually,  an  equilibrium  is  reached  in 
which  the  frequency  p remains  constant  from  generation  to  generation.  At 
this  point,  the  loss  of  A alleles  from  forward  mutation  is  exactly  offset  by  the 
gain  of  A alleles  from  reverse  mutation. 

To  deduce  the  point  of  equilibrium,  suppose  that  the  rate  of  forward 
mutation  from  A to  a is  p per  generation  and  that  the  rate  of  reverse  mutation 
from  a to  A is  v per  generation.  Let  p,  and  q,  denote  the  allele  frequencies  of  A 
and  a in  generation  t,  so  that  p,  + q,  = 1 . An  A allele  in  generation  t can  origi- 
nate in  either  of  two  ways.  It  could  have  been  an  A allele  in  generation  t - 1 
that  escaped  mutation  to  a (which  happens  with  probability  1 - p),  or  it  could 
have  been  an  a allele  in  generation  t - 1 that  mutated  to  A (which  happens 
with  probability  v).  In  symbols, 

f>  = Pw(l  - p)  + (1  - p,_i)v  5.2 

To  solve  equations  of  this  type,  a useful  trick  is  to  determine  whether  the 
relation  can  be  expressed  in  the  form  p,-A  = (pH  - A)B,  where  A and  B are  con- 
stants dependent  only  on  p and  v.  Simplifying,  we  obtain  p,  = p,_}B  + A{  \ - B). 
Putting  Equation  5.2  into  the  same  form  yields  pt  = pt-  j(l  - p - v)  + v.  Equating 
like  terms,  we  deduce  that  B = 1 - p - v and  A(l  - B)  = v.  Consequently, 
A - v/(p  + v).  Hence,  we  can  rewrite  Equation  5.2  in  the  form 


v 

Pt 

p + v 


/ 

Pt- 1 

V 


V 

p + V 


(1-p-v) 
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Because  the  relation  between  and  p,_2  is  the  same  as  that  between  p, 
and  p,_i,  the  solution  to  Equation  5.3  is 


Pi 


p + v 


P 0 


p + vj 


(1-p-v)' 


5.4 


To  understand  what  happens  to  the  allele  frequency  in  the  long  run,  con- 
sider Equation  5.4  in  the  case  when  t is  very  large,  for  example  105  or  106  gen- 
erations. Even  though  1 - p - v is  ordinarily  close  to  1,  the  value  of  t eventually 
becomes  so  large  that  (1  - p - v)'  becomes  approximately  0.  Thus,  the  whole 
right-hand  term  in  Equation  5.4  goes  to  0,  and  so  pt  eventually  attains  a value 
that  remains  the  same  generation  after  generation.  Such  a value  of  p is  called  an 
equilibrium  value,  which  we  will  denote  by  p.  In  case  of  reversible  mutation, 
the  equilibrium  is  found  by  equating  the  left-hand  side  of  Equation  5.4  to  0; 
hence 


p = 


v 

p + v 


5.5 


The  manner  in  which  p,  converges  to  its  equilibrium  value  is  shown  in 
Figure  5.4  for  the  case  p = 1 0 4 and  v = 1CT5.  Note  that,  whatever  the  initial  fre- 
quency of  A,  the  allele  frequency  of  A eventually  goes  to  p,  which  in  this 
example  equals  0.00001/ (0.0001  + 0.00001)  = 0.091.  Figure  5.4  also  indicates 
that  mutation  pressure  is  usually  very  weak  in  changing  allele  frequency, 
inasmuch  as  the  population  requires  thousands  or  tens  of  thousands  of  gen- 
erations to  reach  equilibrium. 


Figure  5.4  Theoretical  change  in  allele  frequency  under  pressure  of  reversible 
mutation.  The  attainment  of  near-equilibrium  values  requires  tens  of  thousands 
of  generations  for  realistic  mutation  rates.  In  this  example,  the  forward  muta- 
tion rate  (A  — » a)  is  p = 10~4  and  the  reverse  mutation  rate  (a  — > A)  is  v = 10-5.  The 
equilibrium  allele  frequency  of  A,  calculated  from  Equation  5.5,  is  0.091. 
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PROBLEM  5.2  The  bacterium  Salmonella  typhimurium  has  a genetic 
switching  mechanism  that  regulates  the  production  of  alternative 
forms  of  a protein  component  of  the  cellular  flagella.  There  are  two 
alleles,  which  we  will  call  A (for  the  "specific-phase"  flagellar  proper- 
ty) and  a (for  the  "group-phase"  flagellar  property).  Switching  back 
and  forth  between  A and  a takes  place  rapidly  enough  that  Equation  5.4 
can  be  applied.  The  transition  from  A to  a has  a rate  of  p = 8.6  x 10~4 
per  generation,  and  that  of  a to  A has  a rate  of  v = 4.7  x 10~3  per  gener- 
ation. These  rates  are  orders  of  magnitude  larger  than  mutation  rates 
typically  observed  for  other  genes.  The  reason  is  that  the  change  from 
A to  a and  back  again  does  not  result  from  mutation  in  the  conven- 
tional sense  but  from  intrachromosomal  recombination  (Simon  et  al. 
1980).  Formally,  however,  we  can  treat  the  system  as  one  with 
reversible  mutation.  In  cultures  initially  established  with  the  frequen- 
cy of  A at  po  = 0,  Stocker  (1949)  found  that  the  frequency  increased  to 
P = 016  after  30  generations  and  top  = 0.85  after  700  generations.  In 
cultures  initiated  with  p0  = 1,  the  frequency  decreased  to  0.88  after  388 
generations  and  to  0.86  after  700  generations.  How  do  these  values 
agree  with  those  calculated  from  Equation  5.4  using  the  estimated 
mutation  rates?  What  is  the  predicted  equilibrium  frequency  of  Al 


ANSWER  Note  that  v/ (p  + v)  = 0.845.  This  is  the  predicted  equilibri- 
um frequency  (Equation  5.5).  Also,  1 - p - v = 0.99444,  and  this  quan- 
tity determines  the  rate  of  approach  to  equilibrium.  For  the  cultures 
with  p0  = 0,  the  predicted  values  are  px  = 0.845  - (0.845)(0.99444)30  = 
0.13  and  p700  = 0.845  - (0.845)(0.99444)'°°  = 0.83.  For  the  cultures  with 
p0  = 1,  the  predicted  values  are  pm  = 0.845  + (0.155)(0.99444)388  = 0.86 
and  p7 oo  = 0.845  + (0.155)(0.99444)7on  = 0.85.  The  predicted  values  are 
in  very  good  agreement  with  the  observations. 


Probability  of  Fixation  of  a New  Neutral  Mutation 

The  assumption  of  an  infinite  population  size  is  not  very  realistic.  In  an 
improved  model  in  which  the  population  is  finite,  the  change  in  frequency  of 
a mutant  allele  depends  not  only  on  the  mutation  pressure  but  also  on  ran- 
dom sampling  from  generation  to  generation.  The  sampling  process,  called 
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random  genetic  drift,  results  in  chance  changes  in  allele  frequency.  The  process 
is  illustrated  in  Figure  5.5.  The  squares  represent  the  2 N alleles  in  the  adult 
population  in  generation  t.  Each  allele  is  assigned  a unique  label — au  a2,  a3, 
• • . , 0-2N — to  temporarily  mask  its  identity  as  either  A or  a.  The  circles  repre- 
sent the  essentially  infinite  pool  of  gametes  in  generation  f.  In  the  gamete 
pool,  each  labeled  allele  has  a frequency  of  1/2N.  The  squares  at  the  bottom 
represent  two  diploid  genotypes  in  generation  f + 1 formed  by  random 
sampling  from  the  pool  of  gametes.  By  chance,  the  two  alleles  forming  a 


Figure  5.5  Random  sampling  of  alleles  in  a finite  population  increases  the 
probability  of  identity  by  descent  (IBD).  Two  randomly  chosen  alleles,  illustrat- 
ed in  the  squares  at  the  bottom,  may  be  IBD  either  because  they  are  replicas  of 
the  same  allele  in  the  immediately  preceding  generation  (a, a,)  or  because  they 
are  replicas  of  the  same  allele  in  a more  remote  generation  (oc, a,). 


172 


Chapter  5 


genotype  may  be  replicas  of  the  same  allele  in  the  previous  generation,  for 
example,  a, a,.  Alternatively,  the  two  alleles  forming  a genotype  may  come 
from  different  alleles  in  the  previous  generation,  for  example,  a,a;. 

The  random  sampling  from  the  gamete  pool  means  that  some  alleles  may 
be  overrepresented  in  generation  t + 1,  relative  to  their  frequency  in  genera- 
tion f,  and  some  alleles  may  be  underrepresented.  Indeed,  any  particular 
allele  has  a good  chance  of  being  unrepresented  in  generation  t + 1,  and 
hence  the  lineage  of  that  allele  is  terminated.  To  be  precise,  each  allele  in  gen- 
eration t has  a chance  of  approximately  1/e  = 0.368  of  not  being  represented 
in  generation  t + 1.  To  understand  why,  consider  the  allele  designated  a, . The 
frequency  of  oq  in  the  gamete  pool  is  1 / 2 N,  and  the  frequency  of  all  other  alle- 
les together  is  therefore  1 - 1/2N.  Because  the  genotypes  in  generation  t + 1 
are  formed  by  the  random  selection  of  2 N alleles  from  the  pool  of  gametes, 
the  distribution  of  the  number  of  oq  and  non-oq  alleles  present  in  generation 
f + 1 is  given  by  successive  terms  in  the  binomial  expansion  (Chapter  1): 


+ 


5.6 


in  which  a represents  the  collection  of  all  alleles  other  than  oq.  Hence,  the 
probability  that  oq  is  not  represented  in  generation  t + 1 is 


-1/  e = 0.368 


5.7 


The  approximation  is  very  good  even  when  N is  quite  small.  For  example, 
when  N = 10,  the  left-hand  side  of  Equation  5.7  equals  0.358,  and,  when 
N = 20,  the  left-hand  side  equals  0.363. 

The  important  implication  of  Equation  5.7  is  that,  owing  to  random  genet- 
ic drift,  the  ancestral  lineage  of  each  allele  faces  a substantial  risk  of  extinction 
in  each  generation.  As  time  goes  on,  the  lineages  progressively  disappear,  one 
or  a few  at  a time.  Eventually,  a time  is  reached  at  which  all  lineages  except 
one  have  become  extinct.  At  that  time,  every  allele  in  the  population  is  iden- 
tical by  descent  with  a particular  allele  present  in  an  ancestral  population. 

The  ultimate  extinction  of  all  but  one  lineage  implies  the  answer  to  the 
question:  What  is  the  probability  that  a single  new  mutation  eventually 
becomes  fixed  in  a population  of  size  2 N?  The  reasoning  is  illustrated  in 
Figure  5.6.  Parts  A and  B show  all  the  alleles  present  in  the  current  genera- 
tion, immediately  after  a new  mutation  (shaded  circle)  has  been  created. 
After  a sufficient  number  of  generations  have  passed,  each  of  the  alleles  in 
the  descendant  population  will  descend  from  a single  allele,  chosen  at  ran- 
dom, in  the  current  population.  In  part  A,  the  descendant  alleles  all  derive 
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Figure  5.6  In  a finite  population,  the  lineages  of  all  alleles  must  trace  back  to  a 
single  allele  in  some  ancestral  population.  Here,  a particular  allele  of  interest  in 
a diploid  population  of  size  N is  indicated  by  the  shaded  circle.  (A)  The  proba- 
bility the  designated  allele  is  not  destined  to  be  the  common  ancestor  of  all  alle- 
les many  generations  in  the  future  is  1 - 1/2 N.  (B)  The  probability  the 
designated  allele  is  destined  to  be  the  common  ancestor  of  all  alleles  many  gen- 
erations in  the  future  is  1/2N.  Hence,  the  probability  of  ultimate  fixation  of  a 
newly  arising  neutral  allele  is  1/2 N. 


from  one  of  the  nonmutants  in  the  current  population;  the  nonmutant  alleles 
have  frequency  1 - 1 / 2 N,  and  so  this  is  the  probability  of  ultimate  fixation  of 
a nonmutant.  In  part  B,  the  descendant  alleles  all  derive  from  the  mutant,  and 
so  1/2N  is  the  probability  of  ultimate  fixation  of  a new  mutant  allele.  More 
generally,  for  neutral  alleles,  which  do  not  affect  the  survival  or  reproduc- 
tion of  the  organism,  the  probability  of  ultimate  fixation  of  a selectively  neu- 
tral allele  in  a finite  population  is  equal  to  the  frequency  of  the  neutral  allele 
in  the  initial  population. 

For  the  lucky  few  neutral  alleles  that  are  eventually  fixed,  the  process 
takes  a long  time:  on  the  average,  4N  generations.  The  method  by  which  this 
result  can  be  deduced  is  considered  in  Chapter  7. 


174 


Chapter  5 


The  Infinite-Alleles  Model 

Recall  from  Chapter  2 that  many  genes  have  more  than  two  alleles  repre- 
sented among  the  organisms  in  a natural  population.  It  is  therefore  of  some 
importance  to  determine  the  expected  level  of  genetic  variation  under 
mutation  pressure.  A convenient  measure  of  genetic  variation  is  the  het- 
erozygosity (the  proportion  of  heterozygous  genotypes).  If  a gene  has  a 
greater  heterozygosity  than  expected  from  mutation  pressure  alone,  then 
other  forces  that  operate  in  nature  must  tend  to  preserve  genetic  variation. 
On  the  other  hand,  if  a gene  has  a smaller  heterozygosity  than  expected,  then 
other  forces  must  tend  to  eliminate  genetic  variation. 

The  heterozygosity  of  a gene  is  a function  of  the  number  of  alleles  and 
their  relative  frequencies.  In  principle,  the  number  of  alleles  of  any  gene 
could  be  very  large.  For  example,  a gene  coding  for  a protein  of  300  amino 
acids  has  a coding  sequence  900  nucleotides  in  length.  Because  each  nucleo- 
tide site  could  be  occupied  by  either  an  A,  T,  G,  or  C,  the  total  number  of  pos- 
sible alleles  is  4 00,  which  equals  about  10l4~.  Hence,  we  can  suppose  that 
every  new  mutation  creates  an  allele  that  does  not  already  exist  in  the  popu- 
lation. This  is  called  the  infinite-alleles  model  of  mutation.  The  infinite-alle- 
les model  is  but  one  way  to  specify  the  characteristics  of  new  mutations. 
Although  it  represents  a somewhat  simplified  view  of  mutation,  it  neverthe- 
less provides  a useful  standard  of  comparison  for  other  models  or  for 
observed  allele  frequencies. 

In  the  infinite-alleles  model,  two  alleles  that  are  identical  by  state  must 
also  be  identical  by  descent  because  of  the  assumption  that  each  mutation 
creates  a unique  allele.  Hence,  in  this  model,  homozygous  genotypes  must  be 
autozygous.  To  measure  the  homozygosity,  therefore,  we  need  to  calculate 
the  autozygosity.  This  can  be  done  with  reference  to  the  finite-population 
model  Figure  5.5.  As  in  Chapter  4,  we  let  F,  be  the  probability  that,  in  gener- 
ation t,  two  alleles  randomly  chosen  from  a population  are  identical  by 
descent.  In  the  context  of  Figure  5.5,  the  randomly  chosen  alleles  are  com- 
bined in  pairs  to  make  genotypes,  and  so  F,  is  also  the  probability  of  autozy- 
gosity in  generation  t.  We  will  use  the  a, a,  and  a, a genotypes  in  generation  t 
in  Figure  5.5  to  derive  an  expression  for  F,  in  terms  of  F,_lr  N,  and  the  muta- 
tion rate  p.  First,  consider  the  genotype  a, a,.  What  is  the  probability  that  this 
genotype  has  alleles  that  are  identical  by  descent?  The  alleles  must  be  identi- 
cal by  descent  provided  that  neither  allele  has  mutated  in  the  course  of  one 
generation,  and  so  the  probability  of  identity  by  descent  in  this  case  is 
(1  - p)2.  Now  consider  the  genotype  a,a^.  These  alleles  are  identical  by 
descent  only  if  two  randomly  chosen  alleles  in  generation  t - 1 are  identical 
by  descent,  and  if  neither  allele  mutated  in  the  course  of  one  generation,  and 
so  the  probability  of  identity  by  descent  in  this  case  is  FM(1  - p)2.  Because 
each  of  the  labeled  a's  in  Figure  5.5  has  the  same  frequency  in  the  gamete 
pool  (namely,  1/2 N),  the  probability  of  a combination  like  a, a,  is  1/2 N and 
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the  probability  of  a combination  like  a, a , is  1 - 1/2N.  Putting  all  this  togeth- 
er, the  recurrence  equation  for  F,  is 

Eventually  an  equilibrium  value  of  F,  call  it  F,  is  attained  in  which  the 
increase  in  autozygosity  from  random  genetic  drift  in  any  generation  is 
exactly  offset  by  the  decrease  in  autozygosity  from  new  mutations.  The  equi- 
librium can  be  found  by  equating  F,  = FM  = F in  Equation  5.8  and  solving. 
Ignoring  terms  in  p2  and  those  in  p/N  because  they  are  expected  to  be  negli- 
gibly small,  the  solution  is 


1 + 4Np 


to  an  excellent  approximation.  Therefore,  the  number  of  selectively  neutral  alle- 
les increases  under  mutation  pressure  until  F satisfies  Equation  5.9.  Being  the 
equilibrium  value  of  the  probability  of  identity  by  descent,  F is  also  the  equi- 
librium value  of  the  autozygosity.  Because  of  the  assumption  in  the  infinite-alle- 
les model  that  each  allele  in  the  population  arises  only  once,  all  genotypes  that 
are  homozygotes  must  also  be  autozygous.  Therefore,  F can  also  be  interpret- 
ed as  the  equilibrium  value  of  the  proportion  of  homozygous  genotypes. 

It  is  an  odd  feature  of  Equation  5.9  that  it  gives  the  equilibrium  homozy- 
gosity of  a population  without  explicit  reference  to  allele  frequencies.  The 
natural  way  to  write  the  homozygosity  expected  with  random  mating  for  n 
alleles  with  frequencies  pv  p2,  p3, . . . , pn,  is 

n 

^p2  =p2+p2++p2  5.10 

/'= 1 

We  thus  have  two  expressions  for  the  equilibrium  homozygosity  in  the 
forms  of  Equatons  5.9  and  5.10.  Because  the  two  equations  refer  to  the  same 
thing,  they  must  equal  each  other,  and  so  Sp,2=  F = l/(4Np  + 1).  Alternative 
approaches  leading  to  essentially  the  same  result  are  discussed  in  Sved  and 
Latter  (1977). 

The  homozygosity  is  the  proportion  of  homozygous  genotypes  in  a pop- 
ulation; the  heterozygosity  is  the  proportion  of  heterozygous  genotypes. 
Hence,  homozygosity  and  heterozygosity  are  opposite  sides  of  the  same  coin. 
Therefore,  if  the  homozygosity  in  a population  is  given  by  F = l/(4Np  + 1), 
then  the  heterozygosity  is  given  by  1 - F = 4Np/(4Np  + 1).  These  functions 
for  the  equilibrium  homozygosity  and  heterozygosity  are  plotted  against 
4Np  in  Figure  5.7.  The  illustration  shows  that  there  is  a rather  narrow  range 
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Figure  5.7  Plot  of  average  homozygosity  and  average  heterozygosity  for  the 
infinite-alleles  model.  Intermediate  values  of  heterozygosity  are  maintained 
over  only  a small  range  of  4/Vp. 


of  4 Np  over  which  an  intermediate  level  of  genetic  variation  (heterozygosity) 
is  maintained.  For  example,  the  equilibrium  heterozygosity  is  in  the  range  0.2 
to  0.8  only  when  4Np  is  in  the  range  0.25  to  4. 

A complication  in  the  interpretation  of  Equation  5.10  is  that  any  number 
of  distributions  of  allele  frequency  can  result  in  the  same  homozygosity.  For 
example,  a population  in  HWE  with  the  four  alleles  at  frequencies  p1  = 0.7, 
Pi  = 0.1,  p3  = 0.1,  and  p4  = 0.1  has  a homozygosity  of  Ip,2  = 0.52;  likewise,  a 
population  in  HWE  with  two  alleles  at  frequencies  p3  = 0.6  and  p2  = 0.4  also 
has  a homozyogosity  of  0.52.  The  problem  that  many  distributions  of  allele 
frequency  can  result  in  the  same  homozygosity  can  be  sidestepped  by  assum- 
ing  that  all  alleles  are  equally  frequent.  If  the  population  contains  n equally 
frequent  alleles,  then  p1  = p2  = p3  = . . . = pn  = 1/n;  the  homozygosity  is  calcu- 
lated from  Equation  5.10  as  Ip,2=  n(l/n)2  = 1/n.  At  equilibrium,  therefore, 
1/n  = F = 1/ (4Np  + 1),  or  n = 4Np  + 1 . The  number  n of  equally  frequent  alle- 
les is  called  the  effective  number  of  alleles,  often  symbolized  as  ne.  Diverse 
distributions  of  allele  frequency  can  be  compared  in  terms  of  their  effective 
number  of  alleles.  Biologically  speaking,  ne  is  the  number  of  equally  frequent 
alleles  that  would  be  required  to  produce  the  same  homozygosity  as 
observed  in  an  actual  population.  In  the  examples  given  at  the  beginning  of 
this  paragraph,  the  four-allele  population  and  the  two-allele  population  with 
identical  homozygosities  of  0.52  also  have  the  same  effective  number  of  alle- 
les, namely  ne  = 1/0.52  = 1.92. 
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PROBLEM  5.3  An  allozyme  study  of  a Caribbean  population  of 
Drosophila  xvillistoni  (Ayala  and  Tracy  1974)  yielded  the  following  esti- 
mated allele  frequencies  for  the  loci  Adk-1  (adenylate  kinase-1),  Lap-5 
(leucine  amino  peptidase-5),  and  Xdh  (xanthine  dehydrogenase). 


Adk-1 

Lap-5 

Xdh 

Allele  1 

0.574 

0.801 

0.446 

Allele  2 

0.309 

0.177 

0.406 

Allele  3 

0.114 

0.014 

0.092 

Allele  4 

0.003 

0.004 

0.034 

Allele  5 

— 

0.004 

0.014 

Allele  6 

— 

— 

0.004 

Allele  7 

— 

— 

0.002 

Allele  8 

— 

— 

0.002 

Estimate  the  effective  number  of  alleles  of  each  gene. 


ANSWER  The  effective  number  of  alleles  is  estimated  as  the  recip- 
rocal of  Ip,2.  For  Adk-1,  tie  = 2.28;  for  Lap-5,  ne  = 1.49;  and  for  Xdh, 
ne  = 2.68.  Note  that  the  effective  number  of  alleles  is  determined  more 
by  the  uniformity  of  allele  frequencies  than  by  the  actual  number  of 
alleles.  For  example,  Lap-5  has  more  actual  alleles  than  Adk-1  but  a 
smaller  effective  number  of  alleles. 


Neutral  Mutations 

The  hypothesis  that  many  genetic  polymorphisms  result  from  selectively 
neutral  alleles  maintained  by  a balance  between  the  effects  of  mutation  and 
random  genetic  drift  is  known  as  the  neutral  theory  or  the  theory  of  selec- 
tive neutrality  (Kimura  1968;  King  and  Jukes  1969).  Mutation  introduces 
new  alleles  into  a population,  and  random  genetic  drift  determines  whether 
a neutral  allele  will  ultimately  be  fixed  or  lost.  (Loss  is  the  usual  outcome.) 
At  equilibrium,  there  is  a balance  between  mutation  and  random  genetic 
drift,  so  that,  on  the  average,  each  new  allele  gained  by  mutation  is  balanced 
against  an  existing  allele  that  is  lost  (or,  more  rarely,  fixed).  The  balance  point 
for  the  homozygosity  in  the  infinite-alleles  model  is  given  in  Equation  5.9. 

In  essence,  the  neutrality  hypothesis  states  that  many  mutations  have  so 
little  effect  on  the  organism  that  their  influence  on  survival  and  reproduc- 
tion is  negligible.  The  frequencies  of  neutral  alleles  are  not,  therefore, 
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determined  by  natural  selection.  Consequently,  if  the  neutrality  hypothesis  is 
true,  then  many  polymorphisms  may  have  no  particular  significance  in  the 
adaptation  of  a species  to  its  environment.  From  the  perspective  of  adapta- 
tion, selectively  neutral  polymorphisms  are  mere  evolutionary  "noise"  and, 
regardless  of  how  much  their  study  may  reveal  about  population  structure 
and  random  genetic  drift,  they  tell  us  little  or  nothing  about  adaptive  genetic 
changes  in  evolution.  Kimura  (1968)  gave  the  irony  a positive  spin  by  noting 
that  "if  my  chief  conclusion  [about  the  prevalence  of  neutral  alleles]  is  correct, 
then  we  must  recognize  the  great  importance  of  random  genetic  drift ...  in 
forming  the  genetic  structure  of  biological  populations."  Quite  so.  Indeed, 
while  neutral  alleles  are  unsuitable  for  the  study  of  genetic  adaptation,  the 
very  fact  that  they  are  invisible  to  natural  selection  makes  them  ideal  for 
mapping  the  geographical  structure  of  populations  and  for  tracing  the  ances- 
tral lineages  of  DNA  sequences  to  make  inferences  about  the  phylogenetic 
relationships  between  species. 

Because  the  neutrality  hypothesis  is  of  fundamental  importance  in  popu- 
lation genetics  and  evolution,  it  has  been  a subject  of  considerable  discussion. 
The  neutrality  hypothesis  was  put  forward  in  the  late  1960s  at  a time  when 
most  of  the  genome  was  supposed  to  have  a protein-coding  function.  Introns 
and  other  noncoding  sequences  were  unknown.  Today  it  is  clear  that  only 
about  4 percent  of  the  mammalian  genome  codes  for  proteins.  The  low  cod- 
ing density  affords  ample  scope  for  mutations  that  have  little  or  no  effect  on 
fitness,  including  some  (but  by  no  means  all)  mutations  in  introns,  pseudo- 
genes, spacers  between  genes,  noncoding  DNA  in  the  centromeric  region  of 
chromosomes,  and  so  forth. 

There  is  still  considerable  controversy  whether  amino  acid  polymor- 
phisms are  selectively  neutral  or  nearly  neutral.  To  assess  the  plausibility  of 
the  neutrality  hypothesis,  many  aspects  of  the  model  must  be  compared  with 
the  situation  in  actual  populations.  One  aspect  of  the  hypothesis  developed  in 
the  preceding  section  concerns  the  homozygosity  to  be  expected  with  the 
infinite-alleles  model.  Using  an  observed  allozyme  homozygosity,  we 
can  estimate  the  effective  number  of  alleles  ne  and,  from  the  expression 
ne  = 4Np  + 1,  estimate  the  corresponding  value  of  A/p.  If  the  resulting  values 
are  grossly  unreasonable,  we  can  safely  reject  the  infinite-alleles  version  of 
the  neutrality  hypothesis  (or  at  least  argue  that  actual  populations  cannot  be 
in  equilibrium). 

Recall  from  Chapter  2 that  observed  values  of  heterozygosity  of  allozyme 
genes  range  from  0.04  to  0.14  in  most  organisms  (see  Figure  2.9).  Observed 
homozygosities  therefore  range  from  1 - 0.04  = 0.96  to  1 - 0.14  = 0.86,  which 
corresponds  to  estimated  ne  in  the  range  1/0.96  = 1.04  to  1/0.86  = 1.16.  Esti- 
mates of  A/p,  calculated  as  ( ne  - l)/4,  therefore  range  from  0.01  to  0.04.  The 
fact  that  the  maximum  estimated  value  of  Np  differs  from  the  minimum  by 
a factor  of  only  about  four  is  surprising,  inasmuch  as  the  population  number 
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in  different  species  ranges  over  a factor  of  104  or  more.  The  apparently  too 
uniform  distribution  of  allozyme  homozygosities  among  diverse  organisms 
has  been  interpreted  as  implying  that  the  neutrality  hypothesis  is  wrong  for 
amino  acid  polymorphisms.  On  the  other  hand,  estimates  of  the  population 
number  in  natural  populations  are  generally  imprecise  because  the  studies 
are  very  difficult,  and  estimates  of  p,  which  in  this  case  is  the  mutation  rate  to 
neutral  alleles,  are  even  more  uncertain. 

Figure  5.8A  shows  a second  type  of  test  of  the  adequacy  of  the  neutrality 
hypothesis  in  explaining  observed  levels  of  genetic  variation  of  allozyme 
genes.  The  shaded  histogram  is  the  observed  distribution  of  heterozygosity 
of  74  genes  in  Caucasians.  The  histogram  outlined  in  solid  lines  is  a comput- 
er-generated theoretical  distribution  expected  with  the  infinite-alleles  model. 
The  observed  average  heterozygosity  is  0.099,  and  the  theoretical  heterozy- 
gosity is  0.091.  The  correspondence  between  the  histograms  is  fairly  good. 


* Mammals  (33  species) 

* Birds  (2  species,  1 subspecies) 

° Fish  (18  species,  1 subspecies) 

* Lizards  (21  species) 

■ Amphibians  (3  species,  1 subspecies) 

Figure  5.8  (A)  Observed  distribution  of  allozyme  heterozygosity  among 

genes  in  Caucasians  (shaded)  along  with  theoretical  distribution  for  selective 
neutrality  (solid  lines).  (B)  Mean  and  variance  of  heterozygosity  among 
allozyme  genes  in  vertebrates.  The  solid  line  is  the  theoretical  curve  for  the  infi- 
nite-alleles model  when  the  mutation  rate  to  neutral  alleles  varies  among  genes 
in  such  a manner  that  the  variance  in  mutation  rate  equals  the  square  of  the 
mean  mutation  rate.  (After  Nei  et  al.  1976.) 
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but  the  observed  distribution  seems  to  include  too  many  genes  with  het- 
erozygosities in  the  range  of  0.35  to  0.55.  (For  a possible  explanation,  see 
Fuerst  et  al.  1977.) 

A third  type  of  test  of  the  neutrality  hypothesis  is  shown  in  Figure  5.8B, 
which  presents  data  on  the  mean  and  variance  of  heterozygosity  in  77  verte- 
brate species.  The  curve  is  the  theoretical  expectation  from  the  infinite-alleles 
model  when  the  rate  of  selectively  neutral  mutation  varies  among  genes  (Nei 
et  al.  1976).  At  first  glance,  the  fit  in  Figure  5.8B  is  impressive.  On  the  other 
hand,  the  observed  points  are  sufficiently  scattered  that  any  number  of  other 
curves  might  fit  at  least  as  well.  Evidently,  statistical  comparisons  of  this  sort 
are  too  lacking  in  power  to  distinguish  between  the  hypotheses. 

A brief  consideration  of  the  phrase  lacking  in  power  may  be  in  order.  The 
neutral  theory  is  useful  in  being  a sort  of  starting  point,  or  null  hypothesis, 
which  provides  predictions  about  the  relationships  among  observed  quanti- 
ties that  can  be  confirmed  or  rejected.  Statistical  tests  of  the  neutral  theory  are 
similar  to  other  types  of  statistical  tests  in  that  two  distinct  types  of  possible 
errors  must  be  balanced.  If  the  tests  are  too  demanding  (for  example,  in  fail- 
ing to  allow  for  the  effects  of  random  sampling  error),  then  data  may  often 
result  in  rejection  of  the  hypothesis  even  when  it  is  true.  False  rejection  is 
called  Type  I error.  On  the  other  hand,  if  the  statistical  test  allows  too  much 
latitude  in  the  data,  then  data  will  seldom  result  in  rejection  of  the  hypothe- 
sis even  when  it  is  false.  False  acceptance  is  called  Type  II  error.  The  tradeoff 
between  Type  I error  and  Type  II  error  is  that  the  probability  of  Type  I error 
cannot  be  decreased  without  increasing  the  probability  of  Type  II  error,  and 
vice  versa.  By  convention,  statisticians  usually  adopt  a 5 percent  criterion  for 
rejection  of  the  null  hypothesis  even  when  it  is  true.  This  is  the  familiar  5% 
level  of  statistical  significance,  and  it  means  that  there  is  a 5%  chance  of 
rejecting  a true  hypothesis  (Type  I error).  With  this  convention,  the  probabil- 
ity of  a Type  II  error  (failing  to  reject  a false  hypothesis)  falls  where  it  may, 
and  a test  with  a relatively  high  probability  of  Type  II  error  is  said  to  be  lack- 
ing in  power. 

Although  the  comparisons  in  Figure  5.8  are  lacking  in  power  and  hence 
are  inconclusive  in  their  support  of  the  neutrality  hypothesis,  many  other 
observations  and  types  of  data  have  been  brought  to  bear  in  assessing  the 
hypothesis.  These  data  often  rely  on  comparison  of  nucleotide  sequences  of 
DNA  in  different  genes  or  in  different  species.  These  types  of  comparisons 
and  the  conclusions  from  them  are  discussed  further  in  Chapter  7. 


LINKAGE  AND  RECOMBINATION 

In  the  context  of  genetic  variation,  the  importance  of  recombination  is  that  it 
allows  linked  alleles  to  become  associated  in  many  different  combinations. 
In  a random  mating  diploid  population,  as  discussed  in  Chapter  3,  linked 
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alleles  come  into  random  association  (linkage  equilibrium)  at  a rate  deter- 
mined by  the  frequency  of  recombination  r (Equation  3.8).  If  r is  small,  it  may 
require  many  generations  for  linkage  equilibrium  to  be  attained.  For  exam- 
ple, the  average  rate  of  recombination  between  adjacent  nucleotides  in 
Drosophila  is  2.7  x 10  ,s,  with  wide  variation  in  different  parts  of  the  genome, 
and  so  nucleotide  polymorphisms  in  the  same  region  of  the  genome  are 
often  in  linkage  disequilibrium.  Consequently,  the  ultimate  fate  of  a new 
mutation  may  depend  to  a considerable  extent  on  the  effects  of  other  poly- 
morphisms with  which  it  is  very  closely  linked.  The  effect  of  recombination 
on  the  fate  of  genetic  variation  is  the  subject  of  this  section. 

Presumed  Evolutionary  Benefit  of  Recombination 

Evolutionary  biologists  have  long  taken  it  for  granted  that  recombination  is 
important  in  evolution  because  it  accelerates  the  rate  of  formation  of  benefi- 
cial gene  combinations.  A graphical  representation  of  the  process  is  illustrat- 
ed in  Figure  5.9.  In  part  A are  two  large  populations,  one  with  no  recombi- 
nation (an  asexual  species)  and  one  with  recombination  (a  sexual  species). 
Each  has  three  favorable  mutations,  a,  b,  and  c,  which  ultimately  become 
incorporated  into  the  genome.  In  the  asexual  species,  the  mutations  are 
incorporated  sequentially  because  each  favorable  mutation  must  take  place 
in  the  genetic  background  of  the  one  before.  The  process  is  slow  because 
each  favorable  mutation  must  be  nearly  fixed  before  there  is  a high  chance 
that  the  next  favorable  mutation  takes  place  in  the  proper  genetic  back- 
ground. In  contrast,  in  the  sexual  population,  there  is  no  such  problem. 
Recombination  between  the  genes  allows  that  triple  mutant  abc  to  be  formed 
almost  immediately. 

The  evolutionary  advantage  of  recombination  outlined  in  Figure  5.9A 
does  not  apply  as  strongly  to  the  small  populations  in  Figure  5.9B.  In  a small 
population,  three  favorable  mutations  are  unlikely  to  be  present  simultane- 
ously, and  so  the  fixation  of  the  favorable  alleles  proceeds  sequentially  in  a 
sexual  as  well  as  in  an  asexual  species. 

Recombination  and  Polymorphism 

Because  recombination  between  adjacent  nucleotides  is  infrequent,  nearby 
nucleotide  sites  tend  to  evolve  together.  Owing  to  genetic  linkage,  forces  that 
tend  to  maintain  genetic  diversity  or  that  tend  to  reduce  genetic  diversity 
will  act  regionally.  Therefore,  the  level  of  polymorphism  found  in  any  region 
of  the  genome  is  expected  to  be  correlated  with  the  level  of  polymorphism  in 
a closely  linked  region.  Evolutionary  forces  thus  leave  their  mark  on  the  level 
and  type  of  genetic  variation  found  within  closely  linked  regions  of  the 
genome. 

In  D.  melanogaster,  an  important  pattern  of  genetic  polymorphism  associ- 
ated with  degree  of  linkage  is  illustrated  in  Figure  5.10.  A region  of  the 
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(A)  Large  population 


Time 


(B)  Small  population 


Figure  5.9  Evolutionary  effect  of  recombination.  (A)  In  a large  population  of 
an  asexual  species  with  no  recombination  (top  panel),  the  favorable  mutations  a, 
b,  and  c must  be  incorporated  into  the  genome  sequentially  because  there  is  no 
mechanism  to  bring  the  favorable  mutations  together;  each  favored  mutation 
must  reach  a high  frequency  to  have  a reasonable  chance  that  the  next  favorable 
mutation  will  take  place  in  the  proper  genetic  background.  With  recombination 
(bottom  panel),  recombination  between  the  favorable  genes  enables  the  triple 
mutant  a b c to  be  formed  very  rapidly.  (B)  The  beneficial  effect  of  recombination 
is  diminished  in  a very  small  population  because,  in  a small  population,  multi- 
ple favorable  mutations  are  unlikely  to  be  present  simultaneously.  (From  Crow 
and  Kimura  1970.) 
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Figure  5.10  Observed  relation  between  the  level  of  nucleotide  polymorphism 
and  the  rate  of  recombination  in  Drosophila.  (From  Aquadro  et  al.  1994.) 


genome  in  which  the  rate  of  recombination  per  nucleotide  is  reduced,  such  as 
near  the  tip  or  near  the  base  of  each  chromosome  arm,  also  tends  to  have  a 
reduced  level  of  genetic  polymorphism  even  though  the  rates  of  mutation  are 
uniform  across  the  chromosome  (Aquadro  et  al.  1994).  In  Figure  5.10,  the 
level  of  polymorphism  is  expressed  as  the  proportion  of  nucleotide  sites  that 
are  polymorphic  (called  0 in  Chapter  2).  For  the  regions  plotted,  0 ranges  over 
more  than  a factor  of  10,  so  there  is  clearly  an  important  effect  of  close  linkage 
in  reducing  the  level  of  polymorphism. 

In  theory,  the  reduction  in  the  level  of  polymorphism  in  regions  of  tight 
linkage  could  be  explained  by  either  of  two  diametrically  opposed  mecha- 
nisms. In  one  mechanism,  the  reduction  results  from  the  fixation  of  favor- 
able mutations.  In  the  other  mechanism,  the  reduction  results  from  the 
elimination  of  harmful  mutations.  These  explanations  have  somewhat  differ- 
ent implications  for  the  pattern  of  polymorphism  in  regions  of  tight  linkage, 
and  so  they  can  be  distinguished  experimentally. 

Consider  first  the  consequences  of  fixation  of  a favorable  mutation.  On  its 
way  to  fixation,  any  new  favorable  mutation  may  carry  along  a small  sur- 
rounding region  of  the  genome  and  render  the  region  monomorphic.  The 
monomorphism  will  not  usually  be  complete.  Some  degree  of  polymorphism 
may  remain  in  the  region,  either  because  new  mutations  happen  in  the 
process  of  fixation  or  because  of  rare  recombination  events  that  take  place. 
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The  process  in  which  a favorable  mutation  becomes  fixed  in  a population  is 
called  a selective  sweep.  During  a selective  sweep  of  a favorable  allele,  any 
neutral  alleles  sufficiently  tightly  linked  go  along  for  the  ride  and  are  said  to 
be  hitchhiking.  The  main  effect  of  hitchhiking  is  that  a small  region  around 
the  favored  allele  will  be  overrepresented  in  the  population.  In  other  words, 
there  will  be  an  apparent  excess  of  rare  genetic  variants  owing  to  the  over- 
representation of  the  region  that  profited  from  the  hitchhiking. 

Consider  next  the  consequences  of  a harmful  mutation.  For  concreteness, 
consider  the  genetic  map  diagrammed  in  Figure  5.11A,  in  which  the  short 
vertical  lines  indicate  adjacent  nucleotide  sites.  One  site  that  can  undergo 
neutral  mutation  is  embedded  in  the  middle  surrounded  by  sites  that  can 
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F U=  In 
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Neutral  site 
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Figure  5.1 1 Effects  of  background  selection  on  nucleotide  polymorphism.  (A) 
A region  of  a chromosome  containing  a set  of  genes  (tick  marks)  that  can  mutate 
to  detrimental  alleles;  within  this  set  of  genes  is  a single  neutral  site.  The  muta- 
tion rate  per  locus  is  p and  the  rate  of  recombination  between  adjacent  loci  is  r. 
(B)  Relative  nucleotide  diversity  as  a function  of  U,  the  total  mutation  rate,  and 
R,  the  total  recombination  rate,  across  the  chromosomal  region.  Note  the  posi- 
tive correlation  between  level  of  nucleotide  polymorphism  and  rate  of  recombi- 
nation. 
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undergo  harmful  mutations  only.  The  rate  of  harmful  mutation  per  site  per 
generation  is  denoted  p,  and  the  rate  of  recombination  between  adjacent  sites 
is  denoted  r. 

Suppose  further  that  each  mutation,  even  when  heterozygous,  is  suffi- 
ciently harmful  that  any  chromosome  in  which  a mutation  is  present  is  ulti- 
mately doomed.  In  the  absence  of  recombination,  the  fate  of  a chromosome 
depends  on  whether  it  is  free  of  harmful  mutations  because,  under  our 
assumptions,  no  chromosome  can  persist  for  long  unless  it  is  free  of  muta- 
tions. The  effect  of  harmful  mutation,  which  in  this  context  is  called  back- 
ground selection,  is  to  reduce  the  number  of  chromosomes  that  can 
contribute  to  the  ancestry  of  remote  generations.  Indeed,  the  effect  of  back- 
ground selection  is  identical  to  that  of  a reduction  in  population  size  except 
that  the  reduction  applies,  not  to  the  genome  as  a whole,  but  to  a tightly 
linked  region  (Charlesworth  et  al.  1993).  Background  selection  therefore 
reduces  the  level  of  genetic  polymorphism.  Looser  linkage  means  that  a 
linked  neutral  mutation  can  escape  the  fate  of  a harmful  neighboring  muta- 
tion by  recombination  with  a mutation-free  chromosome.  Hence,  the  tighter 
the  linkage,  the  greater  the  reduction  in  polymorphism  due  to  background 
selection.  Although  there  is  a reduction  in  the  level  of  polymorphism,  back- 
ground selection  does  not  skew  the  distribution  of  rare  polymorphisms 
because,  for  all  practical  purposes,  the  harmful  allele  merely  causes  one  chro- 
mosome to  drop  out  of  the  population,  much  as  if  it  were  to  go  extinct  by 
chance  (Braverman  et  al.  1995). 

Although  the  evidence  is  not  yet  conclusive,  the  model  of  background 
selection  appears  to  provide  a better  explanation  of  the  Drosophila  data  than 
does  the  model  of  selective  sweeps  (Hudson  and  Kaplan  1995;  Charlesworth 
et  al.  1995).  The  evidence  is  that  rare  nucleotide  polymorphisms  are  found  at 
a frequency  that  would  be  expected  given  the  overall  level  of  polymorphism 
(Braverman  et  al.  1995).  There  is  no  evidence  for  a skewed  distribution 
toward  rare  variants  that  the  model  of  selective  sweeps  would  predict. 

The  effect  of  background  selection  on  the  level  of  genetic  variation  is 
shown  graphically  in  Figure  5.11  B for  the  genetic  map  diagrammed  in  part  A. 
The  curves  are  plotted  from  the  formula 


n = n0e~u/(2l,s+R'1  5.11 

(Hudson  and  Kaplan  1995).  The  symbol  n is  the  nucleotide  diversity,  defined 
as  the  average  proportion  of  nucleotide  differences  between  all  possible  pairs 
of  sequences  (Chapter  2);  7i0  is  the  value  of  k in  the  absence  of  background 
selection.  U and  R refer  to  the  diagram  in  part  A.  U is  the  total  mutation  rate 
per  diploid  genome,  summed  across  all  genes  in  the  region;  and  R is  the  total 
rate  of  recombination  across  the  region,  summed  over  each  of  the  intervals 
between  genes.  The  quantity  hs  measures  the  degree  of  harmfulness  of  each 
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deleterious  mutation  in  a heterozygous  genotype;  the  extremes  are  hs  = 0, 
when  there  is  no  effect  in  the  heterozygote,  and  hs  = 1 , when  the  heterozy- 
gote is  lethal.  The  model  on  which  Equation  5.11  is  based  includes  the 
assumption  that  hs  is  small  but  not  0. 

The  curves  in  Figure  5.11  B are  for  the  specific  value  hs  = 0.02,  which 
means  that  a genotype  that  is  heterozygous  for  one  deleterious  mutation  has 
a 2%  reduction  in  survival  compared  with  a homozygous  nonmutant.  For 
each  curve,  the  relative  nucleotide  diversity  (jt/7t0)  decreases  as  the  total 
recombination  rate  R decreases.  This  result  means  that,  with  tighter  linkage, 
each  detrimental  mutation  that  is  eliminated  takes  with  it  a larger  surround- 
ing region  of  chromosome.  The  relative  nucleotide  diversity  also  decreases  as 
the  total  mutation  rate  increases;  that  is,  greater  background  selection  elimi- 
nates a greater  number  of  chromosomes.  Together,  tight  linkage  and  a mod- 
erate or  high  total  mutation  rate  can  result  in  a very  substantial  decrease  in 
relative  nucleotide  diversity,  reducing  it  to  a level  of  20%  or  less  of  that 
expected  in  the  absence  of  background  selection.  In  view  of  the  reduction  in 
genetic  variation  in  regions  of  reduced  recombination  observed  in  Drosophila 
(Figure  5.10),  the  implication  of  Equation  5.11,  along  with  the  absence  of  a 
skewed  distribution  toward  rare  variants,  suggests  that  much  of  the  effect 
results  from  background  selection. 

Piecewise  Recombination  in  Bacteria 

Many  prokaryotic  organisms  make  use  of  mechanisms  of  recombination  in 
which  a piece  of  DNA  that  is  small,  relative  to  the  size  of  the  entire  genome, 
is  transferred  from  a donor  cell  into  a recipient  cell.  These  mechanisms 
include  transformation,  in  which  free  DNA  is  taken  up  by  the  recipient  from 
the  surrounding  medium;  transduction,  in  which  a DNA  fragment  is  carried 
from  the  donor  to  the  recipient  by  means  of  a virus  particle;  and  conjugation, 
in  which  a replica  of  the  chromosome  from  a donor  cell  is  transferred  into  a 
recipient  cell  by  a gradual  process  requiring  cell-to-cell  contact,  but  the  chro- 
mosome usually  breaks  before  the  transfer  is  complete.  Because  relatively 
short  patches  of  the  genome  participate  in  recombination,  these  processes 
differ  in  their  evolutionary  implications  from  meiotic  recombination  in 
eukaryotes. 

The  main  effect  of  short-patch  recombination  is  that  long-range  linkage 
disequilibrium  tends  to  be  maintained.  For  example,  in  enteric  bacteria,  such 
as  Escherichia  coli,  which  are  part  of  the  normal  intestinal  flora,  linkage  dise- 
quilibrium between  allozyme  loci  is  very  strong  (Whittam  et  al.  1983).  At  the 
level  of  DNA  sequence,  however,  many  genes  have  an  obviously  mosaic 
structure  in  which  different  segments  have  different  phylogenetic  histories 
(DuBose  et  al.  1988).  An  example  from  the  phoA  gene,  coding  for  alkaline 
phosphatase  in  E.  coli,  is  illustrated  in  Figure  5.12.  Among  the  polymorphic 
nucleotide  sites  indicated,  the  unique  nucleotide  at  each  site  is  inscribed  in  a 
box.  At  the  extreme  ends  of  the  gene,  the  alleles  from  strains  RM217T  and 


Sources  of  Variation  1 87 


Nucleotide  site  in  phoA  gene 
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Figure  5.1 2 Evidence  for  recombination  in  the  phoA  gene  in  natural  isolates  of 
£.  coli.  The  pair  of  strains  at  the  top  are  more  similar  at  the  beginning  and  end  of 
the  gene,  the  pair  of  strains  at  the  bottom  are  more  similar  in  the  central  region. 
There  is  significant  clustering  of  the  nucleotide  sites  inscribed  in  boxes,  as 
expected  from  recombination.  (Data  from  DuBose  et  al.  1988.) 


RM45E  are  the  most  closely  related;  in  the  middle  of  the  gene,  from  nucleo- 
tide sites  1425  to  1560,  there  is  a run  of  polymorphic  nucleotides  in  which  the 
similarity  between  RM217T  and  RM45E  is  lost,  as  if  this  part  of  the  gene  had 
been  introduced  by  recombination  with  a more  distantly  related  allele. 
Although  short  runs  of  similar  or  dissimilar  nucleotides  can  also  be  the  result 
of  chance,  chance  effects  can  be  ruled  out  by  appropriate  statistical  tests  for 
recombination  (Stephens  1985;  Sawyer  1989). 

The  finding  that  many  genes  have  a mosaic  ancestry  through  recombina- 
tion seems  at  first  to  contradict  the  finding  of  significant  linkage  disequilib- 
rium between  more  widely  separated  genes.  The  paradox  is  resolved  by  the 
fact  that  each  recombination  event  is  local;  it  replaces  a relatively  short  stretch 
of  the  recipient  chromosome,  and  the  linkage  phase  between  more  distant 
alleles  is  maintained.  The  E.  coli  chromosome,  therefore,  consists  of  clonal 
segments  from  a common  ancestor,  which  is  called  the  clonal  frame  (Milk- 
man and  Bridges  1990, 1993),  interrupted  by  short  segments  derived  from 
recombination  with  diverse  other  clones.  Even  though  the  clonal  frames  are 
interrupted  by  relatively  short  recombinant  segments,  their  integrity  would 
ultimately  be  lost  unless  there  were  occasional  selective  events  favoring  par- 
ticular genotypes. 

Absence  of  Recombination  in  Animal  Mitochondrial  DNA 

Studies  in  animal  population  genetics  often  focus  on  the  DNA  of  mitochon- 
dria. The  mitochondrial  genome  is  informative  about  parentage  because,  in 
most  species  of  animals,  it  is  maternally  inherited  and  does  not  undergo 
recombination.  It  is  also  a small  molecule  present  in  abundant  quantities  in 
most  cells.  In  animals,  mitochondrial  DNA  (mtDNA)  is  a circular  molecule 
typically  in  the  range  from  15  to  20  thousand  base  pairs  in  length.  It  codes 
for  fewer  than  40  genes;  approximately  half  code  for  ribosomal  RNA  or  for 
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transfer  RNA  used  in  mitochondrial  protein  synthesis,  and  the  remaining 
genes  code  for  proteins  used  in  electron  transport  or  oxidative  phosphoryla- 
tion. In  many  species,  including  mammals,  parts  of  the  mtDNA  sequence 
evolve  very  rapidly  in  comparison  with  nuclear  genes,  and  hence  mtDNA 
can  often  be  used  to  make  inferences  about  population  structure  and  recent 
population  history. 

An  example  of  the  utility  of  mtDNA  in  population  studies  is  illustrated  in 
Figure  5.13,  which  summarizes  the  result  of  examining  the  mtDNA  of  87 
pocket  gophers,  Geomys  pinetis,  collected  across  the  geographic  range  of  the 
species  in  Alabama,  Georgia,  and  Florida  (Avise  et  al.  1979).  The  mtDNA 


Figure  5.13  Lineage  relationships  between  mtDNA  types  in  pocket  gophers. 
The  lowercase  letters  are  different  mtDNA  types  grouped  according  to  similari- 
ty and  superimposed  on  a geographical  map  of  the  collection  sites.  The  tick 
marks  across  the  connecting  lines  are  the  numbers  of  inferred  mutational  steps. 
(From  Avise  1994.) 
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from  each  gopher  was  digested  in  turn  with  each  of  six  restriction  enzymes, 
each  cleaving  the  DNA  at  a different  six-base  recognition  site.  The  resulting 
restriction  fragments  were  separated  by  electrophoresis  and  compared 
among  the  animals  to  estimate  the  number  of  nucleotide  differences  affecting 
the  restriction  sites. 

Among  the  87  gophers,  there  were  23  distinct  types  of  mtDNA,  repre- 
sented by  the  lowercase  letters  in  Figure  5.13.  Each  of  these  types  represents 
a maternal  mtDNA  lineage,  distinct  from  other  lineages.  Animals  that  share 
an  mtDNA  type  must  have  a female  ancestor  in  common.  The  branching  net- 
work in  Figure  5.13  estimates  the  matriarchal  phylogeny  of  the  mtDNA.  The 
straight  lines  connect  related  types  of  mtDNA,  and  the  number  of  slashes 
across  each  line  indicates  the  estimated  number  of  nucleotide  differences  in 
the  restriction  sites  between  the  mtDNA  types.  Groups  of  related  mtDNA 
types  are  enclosed  in  thin  black  lines;  the  thickest  lines  delineate  a western 
and  an  eastern  subpopulation  of  gophers  whose  overall  mtDNA  sequence 
differs  by  an  estimated  3%.  Between  the  eastern  and  western  subpopulations, 
there  are  9 nucleotide  differences  among  the  sites  cleaved  by  the  restriction 
enzymes. 

The  mtDNA  network  in  Figure  5.13  also  resolves  population  subdivision 
within  the  western  and  eastern  subpopulations.  This  subdivision  is  indicated 
by  the  mtDNA  types  circumscribed  by  the  thin  black  lines.  Some  of  the 
mtDNA  types  such  as  "k"  and  "p"  are  widespread,  whereas  others  such  as 
“b”  and  "q"  are  more  local  in  their  distribution.  The  local  clones  usually  dif- 
fer from  the  most  widespread  mtDNA  type  in  the  region  by  only  one  or  two 
nucleotides  among  the  sites  cleaved  by  the  restriction  enzymes.  The  example 
in  Figure  5.13  shows  that,  because  of  matrilineal  inheritance  and  the  absence 
of  recombination  in  mtDNA,  the  network  of  mtDNA  types  can  reveal  a great 
deal  about  population  substructure  in  natural  populations. 

MIGRATION 

In  a subdivided  population,  random  genetic  drift  results  in  genetic  diver- 
gence among  subpopulations.  Migration,  which  refers  to  the  movement  of 
organisms  among  subpopulations,  is  a sort  of  genetic  glue  that  holds  sub- 
populations together  genetically  and  that  sets  a limit  to  how  much  genetic 
divergence  can  take  place.  To  understand  the  homogenizing  effects  of  migra- 
tion, it  is  useful  to  study  migration  in  several  simple  models  of  population 
structure. 

One-Way  Migration 

When  migration  takes  place  predominantly  from  one  population  into  anoth- 
er, without  an  equal  amount  of  migration  in  the  reverse  direction,  then  there 
is  said  to  be  one-way  migration.  An  illustration  of  one  way  migration 
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Allele  frequency  of  A = / 
Allele  frequency  of  a = q 


Mainland 


Figure  5.14  Model  of  one-way  migration  from  a large  land  mass  onto  an 
island.  The  allele  frequencies  in  the  source  population,  p*  and  if,  are  assumed  to 
remain  constant,  whereas  those  in  the  recipient  population,  pt  and  qt,  change 
with  time. 


between  a large  mainland  population  and  a small  island  subpopulation  is 
shown  in  Figure  5.14.  For  simplicity,  we  consider  a gene  with  two  alleles,  A 
and  a,  with  respective  frequencies  p*  and  if  on  the  mainland  and  p and  q on 
the  island.  Suppose  that,  in  any  generation,  a proportion  m of  zygotes  in  the 
island  subpopulation  originates  as  a random  sample  of  organisms  from  the 
mainland.  Then,  if  p and  p'  are  the  frequencies  of  A in  the  island  subpopula- 
tion in  two  successive  generations,  it  follows  that 


p'  = (1  - m)p  + mp* 


5.12 


In  Equation  5.12,  m is  called  the  migration  rate  between  the  mainland 
and  the  island.  Subtracting  p*  from  both  sides  of  Equation  5.12  and  simplify- 
ing leads  to  the  expression  p'  - p*  = (1  - m)(p  - p*);  from  this  expression  it  fol- 
lows immediately  that  pt  - p*  = (1  - m)f(p0  - p*),  where  p,  is  the  frequency  of  A 
in  the  island  subpopulation  in  generation  t.  Hence, 


Pi  = P*  +0■-m)t(p0-p,) 


5.13 


Equation  5.13  expresses  mathematically  what  should  be  clear  intuitively: 
With  one-way  migration,  the  allele  frequency  of  A in  the  island  subpopula- 
tion gradually  approaches  that  of  the  mainland  population,  and  the  rate  of 
approach  is  m per  generation.  As  a check  on  Equation  5.13,  note  that,  when 
t = 0,  then  pt  = p0,  as  must  be  the  case,  and  as  t becomes  large,  p,  p*. 

As  an  evolutionary  process  that  brings  potentially  new  alleles  into  a pop- 
ulation, migration  is  qualitatively  similar  to  mutation.  The  major  difference  is 
quantitative:  Generally  speaking,  the  rate  of  migration  among  subpopula- 
tions of  a species  is  vastly  greater  than  the  rate  of  mutation  of  a gene.  The 
contrast  is  illustrated  in  Figure  5.15  for  the  unrealistic  case  in  which  the  A 
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Figure  5.15  Change  of  allele  frequency  with  one-way  migration  assuming 
that  an  allele  A is  initially  fixed  in  the  recipient  population  and  absent  in  the 
source  population.  The  migration  rate  is  m = 0.01.  Note  that  this  is  the  same 
curve  as  in  Figure  5.1  except  that  the  horizontal  axis  is  compressed  to  500  gener- 
ations. The  time  scale  is  different  because,  generally  speaking,  the  migration 
rate  m is  much  larger  than  the  mutation  rate  p. 


allele  present  in  an  island  subpopulation  is  absent  on  the  mainland.  In  this 
case,  Equation  5.13  becomes  p,  = p0(l  - m)1,  which  has  the  same  form  as  Equa- 
tion 5.1  for  one-way  mutation  except  that  m replaces  p.  The  identity  in  the 
shape  of  the  curves  is  apparent,  but  the  time  axis  in  Figure  5.15  is  compressed 
because,  when  m = 0.01,  as  in  this  example,  compared  with  the  value  of 
p = 0.0001  in  Figure  5.1,  it  requires  only  one  generation  of  migration  to  change 
the  allele  frequency  to  the  same  extent  as  100  generations  of  mutation. 

Equation  5.13  holds  more  generally  for  one-way  migration  by  letting  p be 
the  frequency  of  any  allele  in  the  population  that  receives  the  migrants  and  p* 
be  the  frequency  of  the  same  allele  in  the  population  that  supplies  the 
migrants.  Application  of  this  equation  to  estimating  the  amount  of  genetic 
migration  in  certain  human  populations  makes  use  of  the  allele-frequency 
data  given  in  Problem  4.4  (page  126).  The  data  pertain  to  blacks  and  whites  in 
Claxton,  Georgia,  and  blacks  in  West  Africa.  The  case  of  the  MN  blood 
groups  serves  as  an  example.  In  West  Africa,  which  for  the  purpose  of 
this  problem  may  be  regarded  as  the  ancestral  black  population, 
p0  = 0.474  for  the  allele  frequency  of  M.  In  present-day  Claxton  blacks,  pt  = 
0.484.  The  Claxton  white  population  may  reasonably  be  regarded  as  repre- 
sentative of  the  source  of  the  migrants,  and  for  Claxton  whites,  p*  = 0.507. 
Blacks  came  into  the  United  States  on  a large  scale  from  West  Africa  about 
300  years  ago,  hence  t is  about  10  generations.  Substituting  these  estimates 
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into  Equation  5.13,  we  obtain  0.484  = 0.507  + (1  - m)10(0.474  - 0.507),  from 
which  we  infer  that  m = 0.035  per  generation.  This  estimate  can  be  interpret- 
ed as  implying  that,  in  the  genetic  history  of  the  population  of  Claxton 
blacks,  about  3.5%  of  the  alleles  of  the  MN  gene  in  any  generation  were 
newly  introduced  by  genetic  migration  from  whites.  The  apparent  amount  of 
migration  estimated  by  this  method  differs  from  one  locus  to  the  next.  It  also 
differs  according  to  the  geographical  region  in  which  the  white  and  black 
populations  reside. 


PROBLEM  5.4  Estimate  the  amount  of  migration  from  whites  to 
blacks  using  allele  frequencies  for  each  of  the  other  genes  in  Problem 
4.4  in  (page  126). 


ANSWER  Ss  blood  group:  m = -0.013  per  generation;  Duffy:  m = 
0.011;  Kidd:  m = -0.028;  Kell:  m = -0.005:  G6PD,  m = 0.039:  hemoglo- 
bin p:  m = 0.071. 


Problem  5.4  illustrates  some  of  the  difficulties  in  estimating  racial  admix- 
ture from  allele  frequencies.  The  positive  values  of  m vary  widely,  and  the 
negative  values  are  not  consistent  with  the  proposed  model  of  migration. 
Cavalli-Sforza  and  Bodmer  (1971)  remark  that  "The  weakness  of  the  analysis 
is  mostly  due  to  the  uncertainty  of  the  origin  of  black  Americans  . . . and  the 
variability  of  gene  frequencies  in  the  probable  area  of  the  slave  markets  in 
West  Africa.  In  addition,  it  is  unavoidable  that  gene  frequencies  have 
changed  somewhat  from  their  original  values,  due  to  drift  or,  in  some  cases, 
selection.  The  opportunities  for  admixture,  and  the  time  available  for  it,  must 
also  have  varied  widely."  The  most  reliable  gene  among  those  in  Problem  5.4 
is  probably  that  for  the  Duffy  blood  groups  because  the  Fya  allele  is  virtually 
nonexistent  in  all  of  West  Africa.  For  this  gene,  the  estimate  of  m is  about  one 
percent  per  generation,  a result  that  is  consistent  with  the  average  value  for  a 
large  number  of  other  genes  (Cavalli-Sforza  and  Bodmer  1971). 

The  Island  Model  of  Migration 

In  the  island  model  of  migration,  a large  population  is  split  into  many  sub- 
populations dispersed  geographically  like  islands  in  an  archipelago. 
Examples  of  island  population  structure  might  include  fish  in  freshwater 
lakes  or  slugs  in  dispersed  garden  plots.  Each  subpopulation  is  assumed  to 
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be  so  large  that  random  genetic  drift  can  be  neglected.  Consider  an  allele  A 
with  an  average  allele  frequency  among  the  subpopulations  equal  to  p. 
Migration  is  assumed  to  happen  in  such  a way  that  the  allele  frequency 
among  the  migrants  equals  the  average  allele  frequency  among  the  subpop- 
ulations, namely,  p.  The  amount  of  migration  is  again  measured  by  the  para- 
meter m,  which  equals  the  probability  that  a randomly  chosen  allele  in  any 
subpopulation  comes  from  a migrant.  Let  us  consider  a particular  subpopu- 
lation with  an  /I  allele  frequency  of  p,  in  generation  t.  For  a randomly  chosen 
allele  in  this  subpopulation  in  generation  t,  the  allele  could  have  come  from 
the  same  subpopulation  in  generation  t - 1 with  probability  1 - m,  in  which 
case  it  is  an  A allele  with  probability  pt_\.  Alternatively,  the  allele  could  have 
come  from  a migrant  in  generation  t - 1 with  probability  m,  in  which  case  it 
is  an  A allele  with  probability  p.  Because  all  evolutionary  processes  other 
than  migration  are  ignored,  p stays  the  same  in  all  generations.  Altogether, 

Pt  = Pt-iO--™)  + pm  5.14 

Equation  5.14  is  similar  to  Equation  5.2  for  mutation,  and  its  solution  in 
terms  of  p0  is 


pt=p  + {l-m)‘(po-p)  5.15 

The  similarity  with  Equation  5.13  is  apparent:  in  fact,  the  equations  are 
identical  except  that  the  role  of  p*  in  one-way  migration  is  replaced  with  p in 
the  island  model.  Perhaps  less  obvious  is  the  similarity  with  Equation  5.4  for 
reversible  mutation,  in  which  case  v/(p  + v)  plays  the  role  of  p and  p + v 
plays  the  role  of  m.  The  correspondence  between  the  equations  again  empha- 
sizes the  similarity  between  the  effects  of  migration  and  those  of  mutation. 
The  processes  result  in  similar  mathematical  expressions  because  both  muta- 
tion and  migration  act  linearly  on  allele  frequency,  which  means  that  p,  is  a 
linear  function  of  pt-\-  Although  Equation  5.15  for  migration  is  mathematical- 
ly similar  to  Equation  5.4  for  mutation,  the  biological  implications  are  quite 
different.  Because  rates  of  migration  are  typically  much  greater  than  rates  of 
mutation,  changes  in  allele  frequency  are  generally  much  faster  with  migra- 
tion. 

As  an  example  of  the  use  of  Equation  5.15,  suppose  there  are  only  two 
populations  with  initial  allele  frequencies  of  A of  0.2  and  0.8,  respectively, 
with  m = 0.10.  Thus  10  percent  of  the  organisms  in  either  subpopulation  in 
any  generation  are  migrants  having  an  allele  frequency  of  A of  p = 
(0.2  + 0.8)/2  = 0.5.  What  is  the  allele  frequency  of  A in  the  two  populations 
after  10  generations?  For  the  population  with  initial  allele  frequency  0.2,  we 
substitute  p0  = 0.2,  p = 0.5,  and  m = 0.10  into  Equation  5.15  to  obtain  pw  = 
0.5  + (1  - 0.10)10(0.2  - 0.5)  = 0.395;  for  the  other  population,  we  substitute 
p0  - 0.8,  p = 0.5,  and  m = 0.10,  and  so  pw  = 0.5  + (1  - 0.10)10(0.8  - 0.5)  = 0.605. 
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Figure  5.16  Change  of  allele  frequency  with  time  in  five  subpopulations 
exchanging  migrants  at  the  rate  m = 0.1  per  generation.  Note  the  rapid  conver- 
gence to  a common  equilibrium  frequency. 


Another  example  using  Equation  5.15  is  shown  in  Figure  5.16,  where  there 
are  five  subpopulations  (initial  frequencies  1,  0.75,  0.50,  0.25,  and  0),  again 
with  m = 0.10.  Note  how  rapidly  the  allele  frequencies  converge  to  the  same 
value,  in  this  case,  0.5. 

How  Migration  Limits  Genetic  Divergence 

It  is  remarkable  how  little  migration  is  required  to  prevent  significant  genet- 
ic divergence  among  subpopulations  as  measured  by,  for  example,  the  fixa- 
tion index  fST.  To  understand  the  homogenizing  effect  of  migration,  consid- 
er the  model  in  Figure  5.5  (page  171),  in  which  two  alleles  drawn  at  random 
from  a subpopulation  in  generation  t are  replicas  of  the  same  allele  in  genera- 
tion t - 1 with  probability  1/2N  and  replicas  of  different  alleles  in  generation 
t — 1 with  probability  1 — 1 / 2 N.  In  the  first  case,  the  alleles  are  necessarily  iden- 
tical by  descent;  in  the  second  case,  they  are  identical  by  descent  with  prob- 
ability Fm,  where  F is  shorthand  for  FST.  In  either  case,  the  identity  by 
descent  is  unbroken  only  if  neither  allele  is  replaced  by  an  allele  from  a 
migrant,  and  so 


Illustrating  again  the  analogy  between  migration  and  mutation.  Equation 
5.16  is  identical  to  Equation  5.8  measuring  the  effect  of  mutation  on  the 
probability  of  identity  by  descent,  except  that  m replaces  p.  The  equilibrium 
value  F of  F can  be  found  by  setting  F = F,  = Ff_:;  after  expanding  the  squared 
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terms  on  the  right  hand  side,  and  assuming  that  m is  small  enough,  and  N 
large  enough,  that  terms  in  m 2 and  m/N  can  be  ignored,  some  rearrangement 
leads  to 


1 + 4 Nm 

As  might  be  expected.  Equation  5.17  is  identical  in  form  to  Equation  5.9 
for  mutation  but  the  biological  implications  are  very  different  owing  to  the 
fact  that  the  rate  of  migration  is  typically  much  greater  than  the  rate  of  muta- 
tion. 

The  product  Not  in  Equation  5.17  has  a straightforward  biological  inter- 
pretation. The  total  number  of  alleles  in  a subpopulation  of  size  N diploid 
organisms  is  2 N.  In  any  generation,  the  proportion  of  alleles  that  are  replaced 
by  alleles  from  migrant  organisms  is  m;  hence  the  number  of  migrant  alleles 
in  any  generation  equals  2 Nm.  However,  2 Nm  is  also  the  total  number  of  alle- 
les in  Nm  diploid  organisms,  and  so  Nm  can  be  interpreted  as  the  absolute 
number  of  migrant  organisms  that  come  into  each  subpopulation  in  each 
generation. 

Because  the  absolute  number  of  migrants  per  generation  equals  Nm, 
Equation  5.17  implies  that  F decreases  as  the  number  of  migrants  increases. 
Indeed,  the  decrease  in  F with  increasing  Nm  is  extremely  rapid,  as  shown  in 
Figure  5.17.  In  the  extreme  case  of  complete  genetic  isolation  between  the 
subpopulations,  Nm  = 0 and  F = 1.  The  decrease  is  then  so  rapid  that  for: 

• Nm  = 0.25  (one  migrant  every  fourth  generation),  F = 0.50 

• Not  = 0.5  (one  migrant  every  second  generation),  F = 0.33 

• Not  = 1 (one  migrant  every  generation),  F = 0.20 

• Not  = 2 (two  migrants  every  generation),  F = 0.11 

The  implication  of  Figure  5.17  is  that  migration  is  a potent  force  acting 
against  genetic  divergence  among  subpopulations.  On  the  other  hand,  the 
homogenizing  effect  of  migration  should  not  be  overestimated.  The  measure 
of  genetic  divergence  in  Figure  5.17  is  FST,  the  value  of  which  is  determined 
by  the  variance  in  allele  frequency  among  subpopulations  (Equation  4.6)  and 
so  is  affected  primarily  by  polymorphic  alleles  that  are  at  intermediate  fre- 
quencies. Rare  alleles  present  in  one  subpopulation  but  absent  in  others  have 
hardly  any  effect  on  FST.  Because  rare  alleles  are  rare,  they  are  unlikely  to  be 
included  among  migrant  organisms  unless  the  migration  rate  is  very  great, 
and  so  rare  alleles  will  tend  to  remain  present  in  only  one  or  a few  subpopu- 
lations in  a local  area  until  such  time  as  their  frequency  may  become  great 
enough  to  be  dispersed  by  migration.  An  allele  found  in  only  one  subpopu- 
lation is  called  a private  allele.  Next  we  shall  see  that  the  rate  of  migration 
can  be  estimated  by  an  examination  of  the  frequency  of  private  alleles. 
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Number  of  migrant  organisms  per  generation 


Figure  5.1 7 Decrease  in  the  fixation  index  FST  among  subpopulations  at  equi- 
librium in  the  island  model  of  migration.  The  curve  is  that  in  Equation  5.17  giv- 
ing  F as  a function  of  Nm.  In  the  island  model,  Nm  is  the  number  of  migrant 
organisms  that  come  into  each  subpopulation  in  each  generation. 


Estimates  of  Migration  Rates 

One  method  of  estimating  genetic  migration  in  natural  populations  relies  on 
the  finding  that,  in  theoretical  models,  the  logarithm  of  Nm  decreases 
approximately  as  a linear  function  of  the  average  frequency  of  private  alleles 
in  samples  from  the  subpopulations  (Slatkin  1985).  Data  on  the  average  fre- 
quency of  private  alleles  has  been  compiled  and  analyzed  by  Slatkin  (1985), 
and  the  resulting  estimates  of  Nm  and  equilibrium  values  of  FST  are  summa- 
rized in  Table  5.1.  There  is  obviously  considerable  variation  in  Nm  among 
organisms.  However,  many  of  the  values  of  Nm  are  smaller  than  about  2, 
which  means  that  there  is  still  considerable  opportunity  for  genetic  diver- 
gence among  subpopulations. 

A second  kind  of  approach  to  estimating  Nm  in  natural  populations  is 
illustrated  in  Figure  5.18,  which  gives  the  distribution  of  estimated  values  of 
FSJ  among  61  genes  in  natural  populations  of  Drosophila  melanogaster  (Singh 
and  Rhomberg  1987).  The  average  of  the  estimated  values  is  FST  = 0.16, 
which,  assuming  equilibrium,  is  an  estimate  of  1 + 4 Nm  (Equation  5.17).  The 
estimate  is  therefore  Nm  = [(1/0.16)  - l]/4  = 1.3.  This  estimate  is  within  the 
range  for  other  Drosophila  species  in  Table  5.1.  However,  there  are  many 
genes  in  Figure  5.18  that  have  Fsx  values  greater  than  0.30.  An  analogous 
method  of  estimating  Nm  from  the  FgT  values  of  polymorphic  nucleotides 
within  a gene  is  discussed  in  Hudson  et  al.  (1994a).  In  Chapter  7 we  will  con- 
sider how  Nm  can  be  estimated  from  the  genealogies  of  genes. 

Patterns  of  Migration 

Migration  in  actual  populations  is  more  complex  than  is  assumed  in  the 
island  model  of  migration.  In  nature,  migrants  come  primarily  from  nearby 
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TABLE  5. 1 ESTIMATES  OF  Nm  AND  FST 

Type  of 

Species 

organism 

Estimated  Nm 

Estimated  FST 

Stephanomeria  exigua 

Annual  plant 

1.4 

0.152 

Mytilus  edulis 

Mullusc 

42.0 

0.006 

Drosophila  willistoni 

Insect 

9.9 

0.025 

Drosophila  pseudoobscura 

Insect 

1.0 

0.200 

Chanos  chanos 

Fish 

4.2 

0.056 

Hyla  regilla 

Frog 

1.4 

0.152 

Plethodon  ouachitae 

Salamander 

2.1 

0.106 

Plethodon  cinereus 

Salamander 

0.22 

0.532 

Plethodon  dorsalis 

Salamander 

0.10 

0.714 

Batrachoseps  pacifica  ssp.  1 

Salamander 

0.64 

0.281 

Batrachoseps  pacifica  ssp.  2 

Salamander 

0.20 

0.556 

Batrachoseps  campi 

Salamander 

0.16 

0.610 

Lacerta  melisellensis 

Lizard 

1.9 

0.116 

Peromyscus  californicus 

Mouse 

2.2 

0.102 

Peromyscus  polionotus 

Mouse 

0.31 

0.446 

Thomomys  bottae 

Gopher 

0.86 

0.225 

Source:  Data  from  Slatkin  1985. 
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Figure  5.18  Distribution  of  estimated  values  of  FST  for  61  genes  among  natur- 
al  populations  of  Drosophila  melanogaster . Although  the  average  value  of  FST  sug- 
gests migration  at  a level  of  Nm  between  1 and  2,  about  one-third  of  the  genes 
have  Fst  values  greater  than  0.20.  (From  Singh  and  Rhomberg  1987.) 
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populations.  To  the  extent  that  nearby  populations  have  similar  allele  frequen- 
cies, the  effects  of  migration  are  smaller,  and  sometimes  much  smaller,  than  pre- 
dicted by  the  island  model.  Populations  in  nature  may  be  strung  out  along  one 
dimension,  such  as  a river  bank.  Populations  may  also  be  distributed  regularly 
in  two  dimensions,  or  there  may  be  one  large  population  with  an  internal 
genetic  structure  caused  by  the  tendency  for  mating  to  take  place  between 
organisms  born  in  the  same  region.  Analysis  of  the  effects  of  migration  in  such 
complex  population  structures  is  usually  very  difficult.  Among  humans,  migra- 
tion rates  depend  on  age,  sex,  marital  status,  socioeconomic  status,  population 
density,  and  many  other  factors.  Migration  rates  also  can  change  rapidly,  and 
so  a full-blown  theory  of  migration  has  to  be  extremely  complex. 

The  effects  of  migration  on  genetic  differentiation  of  populations  are  seen 
dramatically  in  Figure  5.19.  Part  A pertains  to  the  moth  Biston  betularia,  part  B 
to  the  moth  Gonodontis  bidentata.  Both  species  have  evolved  melanic  (black- 
ened) forms  in  response  to  heavy  air  pollution,  and  the  graphs  give  the  fre- 
quency of  the  melanic  forms  in  the  two  species.  The  geographical  area  in  A 
includes  Liverpool  and  Manchester,  as  viewed  from  rural  Wales.  Note  the 
fall-off  in  frequency  of  melanics  in  the  nonindustrial  areas  toward  the  front  of 
the  graph.  Biston  betularia  exists  in  low  population  densities  and  must  fly  rel- 
atively long  distances  to  find  a mate.  The  resulting  high  rate  of  migration  hin- 
ders differentiation  of  populations,  hence  the  smooth  surface.  In  contrast, 
Gonodontis  bidentata  exists  in  high  population  densities  and  the  migration  rate 
is  low;  hence  there  is  substantial  genetic  differentiation  among  populations, 
as  evidenced  by  the  bumpy  surface  of  the  graph  in  part  B. 

TRANSPOSABLE  ELEMENTS 

A DNA  sequence  that  can  change  its  location  within  the  genome  is  called  a 
transposable  element.  In  being  able  to  create  novel  genome  rearrangements, 
transposable  elements  are  agents  of  genetic  variation.  A transposable  ele- 
ment may  insert  into  a coding  region  and  inactivate  a gene  or  insert  into  a 
regulatory  region  and  change  the  pattern  of  expression  of  the  gene.  Also, 
pairs  of  transposable  elements  may  undergo  recombination  and  create  novel 
chromosome  rearrangements. 

The  process  of  transposition  requires  a protein,  called  transposase,  which 
is  usually  encoded  within  the  sequence  of  the  transposable  element  itself. 
Most  transposable  elements  undergo  transposition  through  a replicative 
process  with  DNA  or  RNA  intermediates.  In  most  cases,  transposition  to  a 
new  location  also  leaves  one  copy  of  the  transposable  element  behind  in  its 
original  location,  so  transposable  elements  can  increase  in  copy  number  in 
the  genome.  Some  transposable  elements  are  also  able  to  regulate  their  own 
rate  of  transposition.  Several  major  classes  of  transposable  elements  can  be 
distinguished  by  their  nucleotide  sequence  organization  or  by  the  details  of 
their  mechanisms  of  transposition  or  regulation. 
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Figure  5. 1 9 (A)  Distribution  of  melanic  moths  of  the  species  Biston  betularia  over 

an  area  including  Liverpool  and  Manchester,  as  viewed  from  rural  Wales.  (B)  Dis- 
tribution of  melanic  moths  of  the  species  Gonodontis  bidentata  over  a smaller  area 
than  in  (A)  but  viewed  from  the  same  perspective.  (From  Bishop  and  Cook  1975.) 
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Factors  Controlling  the  Population  Dynamics 
of  Transposable  Elements 

Transposable  elements  were  originally  discovered  in  maize  as  the  cause  of 
certain  genetically  unstable  mutations.  They  are  now  known  to  be  ubiqui- 
tous among  prokaryotes  and  eukaryotes  (Berg  and  Howe  1989).  The  ability 
of  transposable  elements  to  increase  in  copy  number  and  create  novel  chro- 
mosomal rearrangements  reveals  a dynamic  aspect  of  genome  structure  and 
evolution  not  previously  recognized.  Some  transposable  elements  have 
become  widely  disseminated  among  organisms  because  of  their  ability  to 
undergo  horizontal  transmission  between  reproductively  isolated  genomes. 
Often  referred  to  as  selfish  DNA  because  transposition  alone  may  be  suffi- 
cient for  persistence  in  the  genome  of  a species,  transposable  elements  also 
may  occasionally  create  favorable  mutations  and  thus  become  agents  of 
adaptive  evolution. 

Models  for  the  population  dynamics  of  transposable  elements  usually 
incorporate  several  features: 

• A rate  of  infection,  in  which  genomes  previously  lacking  the  transposable 
element  become  infected  with  it. 

• A rate  of  transposition,  which  determines  how  rapidly  the  copy  number 
increases;  the  effects  of  regulation  are  taken  into  account  by  assuming 
that  the  rate  of  transposition  is  a decreasing  function  of  copy  number. 

• A mechanism,  or  combination  of  mechanisms,  for  eliminating  elements 
from  the  population;  otherwise,  the  copy  number  would  increase  indefi- 
nitely. The  usual  assumption  is  that  the  presence  of  transposable  ele- 
ments in  the  genome  decreases  the  ability  of  an  organism  to  survive  and 
reproduce,  resulting  in  the  elimination  of  some  elements  by  means  of  nat- 
ural selection,  or  that  elements  can  be  eliminated  from  the  genome  by 
means  of  genetic  deletion. 

Through  the  study  of  such  models,  the  diversity  and  novel  attributes  of 
transposable  elements  have  been  incorporated  into  the  concepts  of  popula- 
tion genetics;  see,  for  example,  Langley  et  al.  (1983),  Montgomery  and  Lang- 
ley (1983),  Kaplan  and  Brookfield  (1983),  Sawyer  et  al.  (1987),  Hartl  and 
Sawyer  (1988),  Ajioka  and  Hartl  (1989),  Charlesworth  et  al.  (1994). 

Insertion  Sequences  and  Composite  Transposons  in  Bacteria 

Bacteria  contain  several  types  of  transposable  elements.  Among  the  simplest 
are  insertion  sequences,  which  are  typically  about  1000-2000  nucleotides  in 
length  and  contain  at  least  one  long  translational  open  reading-frame  coding 
for  the  transposase  protein.  The  transposase  recognizes  a short  nucleotide 
sequence,  inverted  in  orientation,  present  at  each  end  of  the  insertion 
sequence,  and  so  the  element  moves  as  an  intact  unit.  The  bacterium 
Escherichia  coli  contains  several  types  of  insertion  sequences,  each  different 
but  all  sharing  the  same  sequence  organization  with  inverted  repeats  and  at 
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least  one  open  reading  frame.  The  factors  controlling  the  population  dynam- 
ics of  insertion  sequences  can  be  deduced  from  the  distribution  of  numbers 
of  each  element  present  among  a sample  of  bacterial  strains  isolated  from 
natural  sources  (Sawyer  et  al.  1987). 

Population  models  of  transposable  elements  in  E.  coli  are  greatly  simpli- 
fied because  the  organism  has  asexual  reproduction,  a low  rate  of  recombina- 
tion among  strains,  and  a low  rate  of  deletion  of  insertion  sequences.  The 
"state"  of  a bacterial  strain  with  respect  to  a particular  insertion  sequence 
may  be  defined  as  the  number  of  copies  n of  the  element  that  are  present. 
Among  the  factors  that  control  the  population  dynamics  are: 

• The  rate  u at  which  uninfected  cells  become  infected;  u is  the  probability, 
per  generation,  that  a cell  initially  in  state  n = 0 ends  up  in  the  state  n = 1. 

• The  rate  T of  transposition  in  infected  strains;  T is  the  probability,  per 
generation,  that  a cell  in  state  n > 0 goes  to  state  n + 1 . 

• The  rate  S at  which  reproduction  of  infected  cells  is  less  than  that  of  unin- 
fected cells.  In  terms  of  the  exponential  growth  model  in  Chapter  1,  if  r0  is 
the  intrinsic  rate  of  increase  of  uninfected  cells  (see  Equation  1.7  on  page 
30)  and  r0'  is  that  of  infected  cells,  then  S - r0-  r0'. 

The  most  general  models  of  this  type  allow  for  T and  S to  be  functions  of 
n,  but  here  we  will  assume  that  they  are  constant.  Note,  however,  that  the 
assumption  that  T is  a constant  implicitly  defines  a type  of  regulation 
because,  if  the  probability  of  transition  from  state  n to  state  n + 1 is  indepen- 
dent of  n,  then  the  probability  of  transposition  per  element  present  in  a strain 
must  equal  T/n  and  this  fraction  is  a decreasing  function  of  n. 

Given  constant  values  of  u,  T,  and  S,  then  it  can  be  shown  that  a popula- 
tion of  bacterial  cells  attains  an  equilibrium  distribution  of  numbers  of  trans- 
posable elements  in  which  the  probability  p,  that  a cell  contains  exactly  i 
copies  of  the  transposable  element  is  equal  to 

p0  = « 5.18« 

and 

Pi  = (1  - a)(l  - qjq*-1  (i  > 1)  5.18b 

where  a = 1 - (u/S)  and  p = T/(T  + S - u)  (Sawyer  and  Hartl  1986,  Sawyer  et 
al.  1987). 

Equation  5.18  can  be  applied  to  the  concrete  case  of  insertion  sequence 
IS30  in  E.  coli,  in  which  the  distribution  of  numbers  among  71  strains  fits  a 
model  with  a = V2  and  p = l/2.  With  these  parameters,  the  distribution  simpli- 
fies to  the  remarkably  simple  formula  p,  = (V2)'  for  i > 0.  Among  71  strains, 
therefore,  the  observed  and  expected  numbers  of  strains  containing  i ele- 
ments are  as  indicated  in  Table  5.2.  The  strains  with  five  or  more  elements 
have  been  grouped  in  order  to  carry  out  a %2  test  of  goodness  of  fit.  This  %2 
test  has  three  degrees  of  freedom  because  a and  q were  estimated  from  the 


202  Chapter  5 


TABLE  5.2  NUMBER  OF  IS30  ELEMENTS  PRESENT  IN  71  NATURAL 
ISOLATES  OFF.  coli 


Number  of  copies 
of  /530  element 

Expected  number 
of  strains 

Observed  number 
of  strains 

0 

35.5 

36 

1 

17.8 

16 

2 

8.9 

13 

3 

4.4 

2 

4 

2.2 

2 

>5 

2.2 

2 

Source:  Data  from  Sawyer  et  al.  1987. 


data.  The  value  of  y2  equals  3.48,  which  has  an  associated  probability  level  of 
about  0.35.  Thus,  the  simple  model  for  IS30  fits  the  observed  data  very  well. 
Although  the  y2  test  cannot  be  completely  trusted  in  this  case  because  of  the 
small  expected  numbers  in  some  of  the  categories,  the  conclusion  is  support- 
ed by  a more  exact  statistical  test  (Sawyer  et  al.  1987).  The  following  problem 
deals  with  the  distribution  of  three  other  insertion  sequences  in  E.  coli. 


PROBLEM  5.5  The  distribution  of  IS  1 fits  Equation  5.18  with 
a = V5  and  <j>  = %;  IS2  fits  the  equation  with  a = 2/5  and  <)>  = %;  and  IS4 
fits  with  a = 2/3  and  $ = 3/4.  Calculate  the  expected  numbers  for  71 
strains  and  carry  out  a y2  test.  (The  observed  numbers  are  from 
Sawyer  et  al.  1987.) 


No.  copies 

0 

1 

2 

3 

4 

>5 

IS  1 

11 

14 

8 

6 

7 

25 

IS2 

28 

8 

12 

5 

5 

13 

IS4 

43 

5 

5 

3 

5 

10 

ANSWER  For  1ST,  the  expected  distribution  is  given  by  p0  = yS/ 
Pi  = (4/2s)(5/6 )'  for  1^*24,  and  p>5  = 1 - (p0  + px  + p2  + p?  + p4).  For 
IS2,  the  expected  distribution  is  p0  = 2/5  and  pt  = (3/io)(2/3)'  (1  < i < 4). 
For  IS4,  the  expected  distribution  is  p0  = 2/3  and  p,  = (%)(3/4)' 
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(1  < ; < 4).  Expected  numbers,  y2  values,  and  associated  probabilities 
are: 


No.  copies 


0 

7 

2 

3 

4 

>5 

x2 

P value 

IS2 

14.2 

9.5 

7.9 

6.6 

5.5 

27.4 

3.58 

0.35 

IS2 

28.4 

14.2 

9.5 

6.3 

4.2 

8.4 

6.31 

0.10 

IS4 

47.3 

5.9 

4.4 

3.3 

2.5 

7.5 

4.00 

0.28 

As  in  the  case  of  IS30,  more  exact  statistical  tests  confirm  the  con- 
clusion that  the  model  fits.  However,  the  distribution  of  IS2  has  a very 
long  tail,  with  nine  strains  containing  from  15  to  20  copies  and  six 
strains  containing  from  21  to  30  copies;  this  distribution  is  approxi- 
mated even  more  closely  by  a model  in  which  the  regulation  of  trans- 
position decreases  more  gradually  than  T/n  (Sawyer  et  al.  1987). 


Apart  from  their  own  evolutionary  dynamics,  insertion  sequences  are 
important  because  they  can  mobilize  other  sequences  in  the  genome.  When 
two  copies  of  an  insertion  sequence  are  on  flanking  sides  of  an  unrelated 
sequence,  the  inverted  repeats  used  in  transposition  are  preferentially  those 
at  the  extreme  ends.  This  kind  of  insertion-sequence  sandwich  constitutes  a 
composite  transposable  element  or  transposon,  which  transposes  as  a single 
unit.  In  a composite  transposon,  the  central  sequence  can  include  one  or  more 
genes  that  confer  a selective  advantage  on  the  host  cell,  such  as  a gene  for 
resistance  to  an  antibiotic;  hence,  the  possession  of  the  transposon  would  be 
favored  in  an  environment  containing  the  antibiotic. 

Mobilization  of  genes  for  antibiotic  resistance,  heavy-metal  resistance,  and 
other  functions  is  one  of  the  principal  evolutionary  implications  of  transpos- 
able elements  in  bacteria.  Transposable  elements  enable  the  piecewise  assem- 
bly of  specialized,  infectious  molecules  called  plasmids.  Plasmids  are 
autonomously  replicating,  circular  molecules  of  DNA  that  exist  within  bacte- 
rial cells.  Many  plasmids  contain  genes  that  promote  their  transfer  between 
different  organisms.  They  may  also  contain  genes,  such  as  those  for  antibiotic 
resistance,  that  are  highly  advantageous  to  their  hosts  in  certain  environ- 
ments. These  genes  are  often  contained  in  transposons,  and  they  undoubted- 
ly entered  the  plasmid  through  transposition  from  a different  plasmid  or  from 
the  genome  of  a previous  host.  Infectious  plasmids  containing  multiple  antibi- 
otic-resistance genes  are  called  resistance  transfer  factors,  and  they  are  a 
major  source  of  multiple  drug  resistance  in  pathogenic  bacteria. 
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Transposable  Elements  in  Eukaryotes 

Transposable  elements  can  have  important  genetic  consequences  as  muta- 
genic agents  by  the  creation  of  novel  genes,  by  alteration  of  the  expression  of 
genes  in  their  vicinity,  and  in  the  genesis  of  major  genomic  rearrangements. 
Transposable  elements  also  have  important  implications  in  population 
genetics  and  evolution.  Several  major  classes  of  transposable  elements  have 
been  identified  that  differ  in  the  molecular  mechanisms  of  transposition. 
Within  each  class,  the  members  can  also  differ  in  DNA  sequence.  Based  on 
similarity  in  DNA  sequence,  transposable  elements  typically  can  be  grouped 
hierarchically  into  "subfamilies,"  in  which  the  elements  resemble  each  other 
quite  closely;  "families,"  in  which  they  differ  from  one  another  somewhat 
more;  and  "superfamilies,"  in  which  the  differences  are  relatively  great. 
Transposable  elements  are  widespread  in  both  animals  and  plants.  For 
example.  Drosophila  melanogaster  contains  multiple  copies  of  each  of  50  to  100 
different  families  of  transposable  elements  (Rubin  1983).  Although  few  of 
these  elements  have  been  studied  in  detail  from  the  standpoint  of  population 
genetics,  indirect  evidence  suggests  that  most  of  the  elements,  like  insertion 
sequences  in  bacteria,  are  mildly  harmful  to  the  host  (Golding  et  al.  1986, 
Lohe  et  al.  1995). 

Horizontal  Transmission  of  Transposable  Elements 

Among  the  most  widespread  families  of  transposable  elements  is  that  of  the 
mariner- like  elements  (MLEs),  typified  by  the  transposable  element  mariner. 
The  molecular  organization  of  the  mariner  element  is  illustrated  in  Figure 
5.20A.  The  element  is  flanked  by  short  (28  base  pair)  inverted  repeats  (IR) 
and  includes  a long  open  reading  frame  coding  for  the  transposase  protein 
(Hard  1989).  Insertion  of  the  element  is  invariably  adjacent  to  a 5-TA-3'  din- 
ucleotide in  the  host  genome  and  is  accompanied  by  a duplication  of  the  din- 
ucleotide, so  that  the  inserted  mariner  is  flanked  by  5'-TA-3'.  The  target 
sequence  and  dinucleotide,  as  well  as  features  of  the  amino  acid  sequence  of 
the  transposase  protein,  identify  a transposable  element  as  an  MLE. 

MLEs  are  widely  distributed  among  insects  and  other  invertebrates 
(Robertson  1993;  Robertson  and  MacLeod  1993).  Figure  5.20B  shows  the  dis- 
tribution among  species  in  the  major  insect  orders  (Coleoptera,  Diptera,  and 
so  forth).  The  number  of  copies  of  an  MLE  per  genome  varies  widely  among 
species,  ranging  from  a few  copies  to  many  thousands.  The  MLEs  in  Figure 
5.20B  have  been  grouped  according  to  similarity  in  nucleotide  sequence  and 
arranged  in  the  form  of  a tree  with  the  root  to  the  left  and  the  tips  of  the 
branches  to  the  right.  There  are  several  subfamilies  of  insect  MLEs,  denoted 
mauritiana,  cecropia,  honeybee,  and  so  forth.  MLEs  in  different  subfamilies 
are  typically  40  to  50%  identical  in  nucleotide  sequence,  and  those  within  the 
same  subfamily  are  usually  60%  or  more  identical.  All  of  the  insect  MLEs  are 
more  closely  related  to  each  other  than  they  are  to  an  MLE  found  in  the  soil 
nematode  Caenorhabditis  elegans. 


Sources  of  Variation 


205 


(A) 


IR 

<4 


Transposase-coding  region 


IR 


Coleoptera 

Diptera 

Hemiptera 

Hymenoptera 

Lepidoptera 

Thysanura 

Other 


5 


Figure  5.20  (A)  The  molecular  organization  of  the  transposable  element 

mariner  showing  the  inverted  repeats  flanking  the  transposase-coding  region. 

(B)  Distribution  of  MLEs  among  species  representing  major  insect  orders  (num- 
bered). Note  that  the  MLEs  can  be  grouped  into  subfamilies  of  elements  (mauri- 
tiana,  cecropia,  and  so  forth)  based  on  their  similarity  in  sequence.  C.  elegans  is 
the  soil  nematode  Caenorhabditis  elegans.  (Data  for  B from  Robertson  1993.) 


Although  MLEs  are  widespread,  their  distribution  is  "spotty,"  which 
means  that,  among  closely  related  species,  a particular  type  of  MLE  may  be 
found  in  some  species  but  not  in  others.  Furthermore: 

• Any  species  may  contain  MLEs  from  two  or  more  different  subfamilies. 

• Closely  related  MLEs  are  often  found  in  distantly  related  species. 

An  example  of  the  second  principle  is  an  MLE  found  in  Drosophila  erecta, 
a close  relative  of  D.  melanogaster,  which  is  97%  identical  in  nucleotide 
sequence  with  an  MLE  found  in  the  cat  flea  Ctenocephalides  felis  (Lohe  et  al. 
1995).  For  comparison,  a gene  coding  for  a subunit  of  the  cellular  sodium 
pump  sequenced  in  both  species  shows  only  39%  nucleotide  identity  at  third 
codon  positions. 
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What  process  can  account  for  the  virtual  identity  between  MLEs  in 
species  as  distantly  related  as  a Drosophila  and  a cat  flea?  One  possibility  is 
that  the  MLE  was  present  in  the  common  ancestor  of  the  species  a few  hun- 
dred million  years  ago  and  then  virtually  stopped  evolving,  so  that  the 
sequences  remain  almost  identical  today.  Unless  the  nucleotide  sequence  is 
very  highly  constrained,  including  third  codon  positions,  this  is  a very 
unlikely  possibility.  Furthermore,  if  MLE  sequences  are  so  constrained,  then 
why  is  there  so  much  sequence  variability  within  and  among  subfamilies? 
More  likely  than  evolution  stopping  dead  in  its  tracks  for  several  hundred 
million  years  is  the  hypothesis  of  horizontal  transmission,  or  the  ability  of 
an  MLE  to  be  transferred  from  a host  species  into  the  germline  of  a different, 
reproductively  isolated  species.  To  account  for  the  D.  erecta-C.  felis  case  by 
horizontal  transmission,  an  MLE  would  have  to  have  been  transmitted  from 
a D.  erecta  ancestor  to  a C.  felis  ancestor  (or  the  other  way  around)  approxi- 
mately 3 to  10  million  years  ago.  Many  additional  examples  of  horizontal 
transmission  of  MLEs  and  other  eukaryotic  transposable  elements  have  been 
discovered.  Although  the  process  of  horizontal  transmission  certainly  takes 
place,  the  rate  at  which  it  happens  and  the  vectors  and  mechanisms  are  as  yet 
unknown. 

Once  introduced  into  a genome,  MLEs  can  persist  through  multiple  spe- 
ciation  events  (Maruyama  and  Hartl  1991).  A lineage  can,  however,  lose  an 
MLE,  as  evidenced  by  D.  melanogaster,  which  has  lost  an  MLE  (the  mariner 
element  itself)  present  in  all  its  closest  relatives.  Two  processes  appear  to  con- 
tribute to  loss  of  an  MLE:  (1)  mutational  inactivation,  which  may  destroy  the 
protein-coding  function  of  an  MLE  or  impair  its  ability  to  transpose;  and  (2) 
stochastic  loss,  by  which  we  mean  the  elimination  of  an  MLE  from  the 
genome  as  a result  of  random  genetic  drift.  There  might  possibly  also  be  a 
contribution  from  natural  selection,  depending  on  the  extent  to  which  pres- 
ence of  the  MLE  itself  is  deleterious.  From  the  standpoint  of  the  host  species, 
an  inactivating  mutation  in  an  MLE  may  be  selectively  neutral,  or  perhaps 
even  favorable,  inasmuch  as  natural  selection  may  act  to  minimize  the  harm- 
ful mutagenic  effects  of  transposition.  Subsequent  mutations  in  an  already 
inactivated  MLE  are  presumably  selectively  neutral  and  ultimately  lost  by 
chance.  The  role  of  mutational  inactivation  and  stochastic  loss  in  the  evolu- 
tionary dynamics  of  MLEs  is  supported  by  the  spotty  distribution  of  MLEs 
among  closely  related  species. 

SUMMARY 

Mutation  provides  the  raw  material  for  evolutionary  change  but,  by  itself, 
mutation  pressure  is  a very  weak  force  for  changing  allele  frequency.  If  allele 
A mutates  to  allele  a at  a rate  p per  generation,  and  a undergoes  reverse 
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mutation  at  a rate  v per  generation,  then  the  equilibrium  frequency  of  A is 
v/(p  + v),  but  the  population  may  require  tens  of  thousands  or  hundreds  of 
thousands  of  generations  to  reach  equilibrium.  In  the  infinite-alleles  model, 
the  equilibrium  value  of  FST  for  neutral  alleles  is  given  by  1/(4 Nyi  + 1),  where 
p is  the  mutation  rate  to  selectively  neutral  alleles;  4Np  + 1 is  called  the  effec- 
tive number  of  alleles.  For  a neutral  allele,  the  probability  of  ultimate  fixation 
equals  the  frequency  of  the  allele  in  the  population.  Statistical  tests  of  the 
neutrality  hypothesis  based  on  the  effective  number  of  allozyme  alleles  or  on 
the  allozyme  heterozygosity  are  inconclusive  owing  to  lack  of  statistical 
power. 

Recombination  allows  the  formation  of  beneficial  combinations  of  genes. 
In  Drosophila,  there  is  a positive  correlation  between  the  rate  of  recombina- 
tion and  the  level  of  nucleotide  polymorphism:  regions  of  reduced  recombi- 
nation are  less  polymorphic.  The  reduced  polymorphism  could  result  from 
selective  sweeps  of  favorable  mutations  or  from  background  selection 
against  detrimental  mutations.  In  prokaryotes,  there  is  extensive  linkage  dis- 
equilibrium over  long  genetic  distances  in  spite  of  the  fact  that  each  gene 
may  have  a mosaic  ancestry  owing  to  intragenic  recombination.  The  appar- 
ent paradox  results  because  recombination  in  prokaryotes  usually  involves  a 
short  stretch  of  DNA  and  the  process  is  infrequent.  In  animal  mitochondrial 
DNA,  the  absence  of  recombination  enables  the  identification  of  mitochon- 
drial lineages. 

Migration  hinders  genetic  divergence  among  subpopulations.  In  finite 
populations,  the  equilibrium  value  of  Fs t with  migration  is  given  by 
l/(4Nm  + 1),  and  only  a few  migrants  per  generation  are  sufficient  to  keep 
Fst  smaller  than  about  10%.  On  the  other  hand,  a small  amount  of  migration 
is  usually  not  sufficient  to  disperse  rare  alleles  among  subpopulations,  and  so 
rare  alleles  are  often  unique  to  one  or  a few  subpopulations. 

Transposable  elements  are  ubiquitous  in  the  genomes  of  all  organisms. 
Their  tendency  to  increase  in  copy  number  through  their  ability  to  repli- 
cate and  transpose  is  usually  offset  by  the  harmful  effects  of  the  insertions 
themselves;  hence,  there  is  an  equilibrium  distribution  of  copy  number 
among  organisms.  Some  transposable  elements  have  direct  or  indirect  ben- 
eficial effects;  bacterial  transposons  that  carry  genes  for  antibiotic  resis- 
tance provide  an  example.  Bacterial  transposons  are  disseminated  among 
organisms  and  among  species  by  transmission  of  infectious  plasmids  in 
which  the  transposons  may  reside.  In  eukaryotes,  horizontal  transmission 
can  take  place  between  species  in  spite  of  absolute  reproductive  isolation. 
Many  transposable  elements  can  be  grouped  into  subfamilies,  families, 
and  superfamilies  based  on  their  degree  of  nucleotide  sequence  similarity. 
The  mariner- like  elements  (MLEs)  are  exceptionally  widespread  among 
insects  and  other  invertebrates.  The  innate  tendency  of  an  MLE  to  increase 
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in  copy  number  in  a genome  is  offset  by  mutational  inactivation  and  ulti- 
mately stochastic  loss.  These  offsetting  processes  may  explain  the  spotty 
distribution  of  MLEs  observed  among  closely  related  species. 


PROBLEMS 

1.  Most  protein-coding  genes  have  a forward  mutation  rate  (normal  to 
mutant)  that  is  at  least  an  order  of  magnitude  greater  than  the  reverse 
mutation  rate  (mutant  back  to  normal).  Why  should  this  be  the  case? 

2.  A classical  bacterial  experiment  demonstrated  that  mutations  occur  at 
random  and  not  in  response  to  specific  selection  pressures  for  them.  The 
experiment  used  sterilized  velvet  to  imprint  the  geometrical  pattern  of 
bacterial  colonies  on  an  agar  surface  in  a petri  dish  (a  "plate"),  which  was 
used  to  replicate  the  pattern  by  impressing  the  velvet  on  sterile  nutrient 
agar  in  a selective  plate  containing  an  antibiotic.  Colonies  on  the  original 
plate  giving  resistant  cells  on  the  selective  plate  were  dispersed  into  sin- 
gle cells,  spread  onto  a nutrient  agar  plate  without  antibiotic,  and 
allowed  to  multiply  into  colonies.  This  procedure  was  repeated  until  one 
or  more  colonies  on  the  unselective  media  consisted  exclusively  of  antibi- 
otic resistant  cells.  How  does  this  experiment  prove  the  point? 

3.  Estimation  of  mutation  rates  from  bacterial  cultures  can  be  tricky 
because,  if  a mutation  occurs  early  in  the  life  of  a culture,  the  final  fre- 
quency will  be  very  high;  but  if  it  occurs  late,  the  final  frequency  will  be 
low.  The  fluctuation  test  is  a method  for  getting  around  this  problem  by 
growing  many  smaller  cultures  and  estimating  the  mutation  rate  from 
the  proportion  of  cultures  that  contain  no  mutations  using  the  zero  term 
of  the  Poisson  distribution  P0  = exp(— p N),  where  P0  is  the  proportion  of 
cultures  with  no  mutations,  p is  the  mutation  rate,  and  N is  the  average 
number  of  cells  per  culture.  In  one  experiment  for  bacteriophage  T1  resis- 
tance, n/2o  cultures  contained  no  mutations  and  the  average  number  of 
cells  per  culture  was  5.6  x 108.  Estimate  p. 

4.  If  recessive  lethals  occur  independently  in  Drosophila  autosomes,  and  the 
probability  that  an  autosome  contains  one  or  more  recessive  lethals  is 
0.35  (a  typical  figure  for  chromosomes  isolated  from  natural  popula- 
tions), what  is  the  average  number  of  recessive  lethals  per  chromosome? 
Assume  that  the  distribution  of  lethals  is  Poisson  so  that  the  probability 
of  a chromosome  containing  exactly  i lethals  is  P,  = (m‘/i'.)exp(-m),  where 
m is  the  mean. 

5.  The  doubling  dose  of  radiation  is  the  quantity  of  radiation  that  induces  as 
many  mutations  as  occur  spontaneously,  so  the  total  mutation  rate  of 
organisms  exposed  to  the  doubling  dose  equals  two  times  the  sponta- 
neous mutation  rate.  Below  are  the  induction  rates  per  rad  of  x-rays  (a 
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standard  measure  of  dose)  for  various  genetic  end  points  in  irradiated 
male  mice,  along  with  the  spontaneous  rates.  What  are  the  corresponding 
doubling  doses? 


Induction  rate/rad  Spontaneous  rate 

Dominant  lethals  5 x lCPVgamete  2 to  10  x 10“2/gamete 

Recessive  visibles  7 x 10~8/locus  8 x lO'Vlocus 

Reciprocal  translocations  1 to  2 x 10_5/cell  2 to  5 x lO^/cell 

6.  For  irreversible  mutation  with  a forward  mutation  rate  p = 5 x 10“6,  cal- 
culate the  allele  frequency  p after  10, 100,  1000,  and  10000  generations, 
assuming  p0  = 1.0. 

7.  If  a transposable  genetic  element  becomes  fixed  at  a particular  site  but 
undergoes  deletion  at  the  rate  of  one  percent  per  generation,  how  many 
generations  are  required  to  decrease  the  frequency  of  the  element  at  the 
site  to  90%? 

8.  The  following  data  give  the  frequency  q of  bacteria  resistant  to  a bacterio- 
phage after  t generations  of  chemostat  growth.  At  t = 12  hours  a novel 
metabolite  was  added  to  the  medium. 

a.  What  is  the  basal  rate  of  mutation  to  resistance? 

b.  What  is  the  effect  of  the  novel  metabolite  on  the  mutation  rate? 


t 

q 

t 

q 

0 

i x to-6 

16 

7.04  x 10^ 

4 

3 x icr6 

20 

7.08  x 10"6 

8 

5 x icr6 

24 

7.12  x 10"6 

12 

7 x lO'6 

9.  In  the  forward  and  reverse  mutation  model,  what  is  the  equilibrium  fre- 
quency p of  A if 

a.  p = 10“5  and  v = 1CT6? 

b.  p is  increased  tenfold? 

c.  v is  increased  tenfold? 

d.  both  are  increased  tenfold? 

10.  In  the  forward  and  reverse  mutation  model,  show  that  the  time  required 
for  the  allele  frequency  to  go  halfway  to  equilibrium  is  approximately 
t = 0.7/ (p  + v)  generations.  Use  the  approximation  that  ln(l  - x)  = -x 
when  x is  small.  What  time  is  required  to  go  halfway  to  equilibrium 
when  p = 10~5  and  v = 10"6? 

11.  In  the  irreversible  mutation  model,  what  is  the  frequency  qt  of  allele  a in 
generation  t if  the  mutation  rate  changes  from  generation  to  generation? 
If  the  equation  q,  = q0  + pf  is  applied  to  this  situation,  what  value  corre- 
sponds to  p? 
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12.  Suppose  a gene  has  eight  alleles  at  frequencies  0.55,  0.20,  0.09,  0.06,  0.04, 
0.03,  0.02,  and  0.01.  What  is  the  effective  number  of  alleles?  What  would 
the  effective  number  be  if  each  allele  had  a frequency  of  0.125? 

13.  Why  is  the  effective  number  of  alleles  essentially  independent  of  the 
number  of  rare  alleles? 

14.  What  is  the  equilibrium  heterozygosity  in  a population  of  effective  size 
50  if  new  neutral  mutations  are  introduced  at  a rate  10~s  by  mutation  and 
at  a rate  10'3  by  migration? 

15.  If  the  average  number  of  alleles  of  a gene  is  1 + x per  diploid  individual, 
where  0 < x < 1,  then  what  is  the  heterozygosity?  (Note  that  one  diploid 
individual  is  a random  sample  of  two  alleles.) 

16.  Calculate  the  autozygosity  F after  200  generations  in  a random  mating 
population  of  effective  size  N = 50. 

17.  In  an  isolated  random  mating  population  of  effective  size  N,  how  many 
generations  of  random  genetic  drift  are  required  to  produce  the  same 
average  inbreeding  coefficient  F as  obtained  in  one  generation  of  brother- 
sister  mating  (for  which  F = V4)?  Use  the  approximation  [1  - 1/(2N)]'  = 
exp(-f/2N). 

18.  If  a mainland  population  of  snails  has  an  allele  frequency  of  0.8  and  an 
island  population  has  a frequency  of  0.2,  how  many  generations  are 
required  for  the  island  population  to  achieve  an  allele  frequency  of  0.5, 
given  a migration  rate  of  0.01? 

19.  If  four  populations  with  allele  frequencies  0.2,  0.4,  0.6,  and  0.8  undergo 
migration  according  to  the  island  model  with  m = 0.05,  what  are  the 
expected  allele  frequencies  after  10  generations? 

20.  In  the  island  model  of  migration,  how  does  the  variance  in  allele  fre- 
quency among  populations  change  as  a function  of  m and  f? 

21.  When  random  genetic  drift  is  offset  by  migration  among  populations  in 
the  island  model,  what  value  of  m is  necessary  to  keep  the  equilibrium 
value  of  F smaller  than  0.05? 
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Darwinian  Selection 

Natural  Selection  Fitness  Haploid  Models  Diploid  Models 
Mutation-Selection  Balance  Complex  Modes  of  Selection 
Kin  Selection  Interdeme  Selection 


hus  far  in  this  book,  the  term  natural  selection  has  been  used  in 
the  informal,  intuitive  sense  used  by  Darwin  in  The  Origin  of 
Species  (1859): 


Owing  to  this  struggle  for  life,  variations,  however  slight  and  from  whatever 
cause  proceeding,  if  they  be  in  any  degree  profitable  to  the  individuals  of  a 
species,  in  their  infinitely  complex  relations  to  other  organic  beings  and  to 
their  physical  conditions  of  life,  will  tend  to  the  preservation  of  such  individ- 
uals, and  will  generally  be  inherited  by  the  offspring.  The  offspring,  also,  will 
thus  have  a better  chance  of  surviving,  for,  of  the  many  individuals  of  any 
species  which  are  periodically  born,  but  a small  number  can  survive.  I have 
called  this  principle,  by  which  each  slight  variation,  if  useful,  is  preserved,  by 
the  term  Natural  Selection. 


Modern  formulations  of  natural  selection  are  less  literary  and  usually 

compacted  into  a form  resembling  a logical  syllogism: 

• In  all  species,  more  offspring  are  produced  than  can  possibly  survive  and 
reproduce. 

• Organisms  differ  in  their  ability  to  survive  and  reproduce — in  part  owing 
to  differences  in  genotype. 

• In  every  generation,  genotypes  that  promote  survival  in  the  current  envi- 
ronment are  present  in  excess  at  the  reproductive  age  and  thus  contribute 
disproportionately  to  the  offspring  of  the  next  generation. 
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Through  natural  selection,  therefore,  alleles  that  enhance  survival  and 
reproduction  increase  gradually  in  frequency  from  generation  to  generation, 
and  the  population  becomes  progressively  better  able  to  survive  and  repro- 
duce in  the  environment.  The  progressive  genetic  improvement  in  popula- 
tions resulting  from  natural  selection  constitutes  the  process  of  evolutionary 
adaptation. 

In  the  brief  description  of  natural  selection  quoted  above,  Darwin  uses  the 
term  individual  three  times.  The  unit  of  selection  is  the  individual  organism— 
not  the  species,  not  the  subpopulation,  not  the  sibship.  It  is  the  performance 
of  the  individual  organism  that  matters.  Each  individual  organism  competes 
in  the  struggle  for  existence  and  survives  or  perishes  on  its  own.  Darwin  also 
used  the  terms  “struggle  for  existence"  and  "survival  of  the  fittest"  as  syn- 
onyms for  natural  selection,  but  he  emphasized  that  he  employed  the  terms 
in  their  widest  metaphorical  sense  to  include  not  only  the  life  of  the  organism 
but  also  the  success  of  the  organism  in  leaving  progeny:  fecundity  is  as 
important  as  survival.  In  this  chapter,  we  shall  see  how  Darwin  s concept  of 
“survival  of  the  fittest"  of  individual  organisms  has  been  made  more  formal 
and  quantitative  and  incorporated  into  models  describing  the  change  in 
allele  frequency  under  natural  selection.  These  models  show  that  natural 
selection  acts  simultaneously  on  different  components  of  fitness  and  can 
operate  at  different  levels  of  population  structure. 

SELECTION  IN  HAPLOID  ORGANISMS 

Selection  acts  on  the  phenotype,  not  on  the  genotype,  and  the  total  pheno- 
type is  determined  by  many  genes  that  interact  with  each  other  as  well  as 
with  numerous  environmental  factors.  However,  in  exploring  the  conse- 
quences of  selection,  it  is  convenient  to  focus  on  changes  in  the  frequency  of 
the  alleles  of  a single  gene.  We  shall  begin  by  examining  selection  in  its  sim- 
plest form  operating  in  a haploid,  asexual  organism,  such  as  a species  of  bac- 
teria. In  haploids,  selection  is  realized  as  differential  population  growth; 
hence  we  shall  make  reference  to  the  discrete  and  continuous  models  of  pop- 
ulation growth  examined  in  Chapter  1.  The  overall  process  of  selection  is 
identical  whether  population  growth  is  in  discrete  or  continuous  generations, 
but  the  models  have  a somewhat  different  parameterization  and  it  is  neces- 
sary to  relate  the  models  to  avoid  confusion  later. 

Discrete  Generations 

Consider  two  bacterial  genotypes,  A and  B,  that  reproduce  asexually.  For 
simplicity,  we  will  assume  the  discrete  model  of  population  growth  dis- 
cussed in  Chapter  1 and  we  set  a and  b equal  to  the  rates  of  population 
growth  of  A and  B,  respectively.  Equation  1.5  implies  that  A,  = (1  + a)'A0  and 
B,  = (1  + b)‘B0/  where  A,  and  B,  are  the  number  of  cells  of  genotype  A and 
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genotype  B,  respectively,  at  time  t.  Selection  takes  place  when  a*b.  Figure 
6.1  A is  an  example  in  which  the  growth  rates  of  A and  B are  a = 0.04  and  b = 
0.05,  respectively.  Both  populations  increase  in  size  exponentially,  but  that  of 
B increases  faster  than  that  of  A.  In  most  cases,  we  are  not  interested  in  the 
actual  number  of  A cells  or  B cells  but  in  the  proportion  of  all  cells  that  are  of 
type  A.  Equivalently,  we  can  examine  the  ratio  of  the  number  of  A cells  to 
that  of  B cells  at  time  t,  which  is  given  by 


(1  + a) 

‘Ao-iv‘ 

(A  \ 

[l  + bj 

Bo 

v Bo , 

The  outcome  of  selection  is  determined  by  the  ratio  of  a to  b because,  if 
a <b,  then  the  ratio  of  A cells  to  B cells  decreases  until,  ultimately,  A is  lost; 
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Figure  6. 1 (A)  Discrete  population  growth  of  two  hypothetical  bacterial 

strains,  A and  B,  in  which  the  growth  rate  are  4%  per  generation  for  A and  5% 
per  generation  for  B.  For  clarity,  the  population  size  is  plotted  every  second  gen- 
eration. The  initial  cell  numbers  are  1.6  x 10s  for  A and  0.4  x 105  for  B.  (B)  Ratio 
of  cell  numbers  of  A:  B.  Because  the  B population  grows  faster  than  the  A popu- 
lation, the  proportion  of  A in  the  total  population  decreases. 
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conversely,  if  a > b,  then  the  ratio  of  A cells  to  B cells  increases  without  limit. 
Figure  1.6B  shows  the  change  in  A/B  for  the  example  in  part  A.  From  a value 
of  4 at  the  beginning,  the  ratio  declines  to  a value  of  1.54  in  100  generations; 
these  ratios  correspond  to  frequencies  of  A of  0.80  and  0.61,  respectively. 

In  the  selection  in  Figure  6.1,  it  is  not  necessary  to  specify  whether  n and  b 
differ  because  of  survivorship  or  fecundity.  All  that  matters  is  that  they  do 
differ.  It  is  also  important  that  the  outcome  depends  only  on  the  ratio 
(1  + a)/ (1  + b),  which  means  that,  in  practice,  we  do  not  need  to  know  the 
absolute  growth  rates  of  A and  B but  only  their  relative  values  (their  ratio).  In 
Equation  6.1,  w represents  the  ratio  (1  + a)/(  1 + b).  The  symbol  zv  is  conven- 
tionally used  in  discrete  models  of  selection  and,  in  this  example,  it  is  the  rel- 
ative fitness  of  genotype  A to  that  of  genotype  B.  In  other  words,  in  a haploid 
organism,  the  relative  fitness  equals  the  ratio  of  the  growth  rates. 

Although  it  is  sometimes  instructive  to  do  so,  it  is  not  necessary  to  keep 
tiack  of  population  size  in  models  of  selection.  The  variable  of  interest  is  usu- 
ally the  allele  frequency  and  not  the  population  size.  Therefore,  let  p,  and  q, 
represent  the  frequencies  of  genotypes  A and  B,  respectively,  in  generation  t, 
with  p,  + qt  = 1.  A method  to  relate  the  frequencies  of  A and  B in  any  two  suc- 
cessive generations  is  illustrated  in  Table  6.1.  For  ease  of  discussion,  we 
divide  each  generation  into  three  phases:  birth,  selection,  and  reproduction. 
In  generation  t - 1,  the  frequencies  of  A and  B at  birth  are  and  qt_v  respec- 
tively. The  genotypes  A and  B are  assumed  to  survive  in  the  ratio  zv  : 1,  which 
means  that  w is  the  probability  of  survival  of  an  A genotype  relative  to  that  of 
an  B genotype.  As  before,  the  absolute  probabilities  of  survival  of  the  geno- 


TABLE  6.1  A MODEL  OF  SELECTION  IN  A HAPLOID  ORGANISM,  IN 

WHICH  w IS  THE  PROBABILITY  OF  SURVIVAL  OF  AN  A CELL 
RELATIVE  TO  THAT  OF  A B CELL 


Genotype 


Generation  t - 1 

A 

B 

Frequency  before  selection 

Pm 

qt  i 

Relative  fitness 

zu 

1 

After  selection 

Pm  w 

‘lM 

Generation  t 

PmW 

Qm 

PmW  + q,.x 

PmW  + 

Note.  The  fractions  in  the  bottom  line  are  expressions  for  the  allele  frequencies  in  generation  t 

in  terms  of  those  in  generation  t - 1.  Although  this  model  assumes  differential  survival,  zv  ■ 1 
could  also  be  the  relative  probability  of  reproduction  of  A and  B.  More  generally,  the  relative 
fitness  w : 1 represents  the  net  output  of  A : B for  the  combined  effects  of  differential  survival 
and  reproduction. 
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types  are  not  relevant.  All  that  matters  is  the  ratio.  After  selection,  the  ratio  of 
frequencies  of  A : B equals  x zu  : qt_  j x 1 . If  the  surviving  genotypes  repro- 
duce with  equal  efficiency,  then  the  frequencies  at  birth  in  the  following  gen- 
eration are  given  by  the  expressions  across  the  bottom  in  Table  6.1;  the 
denominators  in  these  expressions  are  necessary  to  make  the  allele  frequen- 
cies in  generation  t sum  to  1 . 

For  comparison  with  Equation  6.1,  consider  that  p,  is  the  number  of  A 
cells  in  generation  t divided  by  the  total;  likewise,  q,  is  the  number  of  B cells 
divided  by  the  total.  Therefore,  the  ratio  p,/q,  equals  the  ratio  of  A cells  to  B 
cells  in  generation  t because  the  denominators  cancel.  The  expressions  in 
Table  6.1  imply  that  the  ratio  of  pi q in  any  generation  equals  zu  multiplied  by 
the  ratio  of  p! q in  the  previous  generation,  and  so 
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The  right-hand  side  of  Equation  6.2  is  identical  to  that  in  Equation  6.1 
except  that  the  relative  frequencies  p and  q replace  the  absolute  number  of 
cells  of  type  A and  type  B.  Hence,  to  deduce  the  outcome  of  selection,  we  do 
not  need  to  keep  track  of  population  size.  All  we  need  to  know  is  the  relative 
fitness  zu  and  the  initial  frequencies  p0  and  q0. 

For  application  to  experimental  data.  Equation  6.2  is  often  transformed  by 
taking  the  logarithm: 
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Equation  6.3  means,  for  example,  that  if  the  values  of  pt/qt  are  monitored 
in  an  experimental  population  of  bacteria  over  the  course  of  time,  then  a plot 
of  log  (pt/ q, ) against  time  (in  generations)  should  yield  a straight  line  with 
slope  equal  to  log  zu.  This  kind  of  experiment  is  examined  in  the  following 
problem. 


PROBLEM  6. 1 In  the  intestinal  bacterium  E.  coli,  the  gene  gnd  codes 
for  the  enzyme  6-phosphogluconate  dehydrogenase  (6PGD),  which  is 
used  in  the  metabolism  of  gluconate  but  not  in  the  metabolism  of 
ribose.  The  data  below  were  obtained  in  experiments  in  which  other- 
wise genetically  identical  strains  containing  the  alleles  gnd(RM77C) 
and  gnd(RM43A)  were  grown  in  competition  in  chemostats  in  which 
the  sole  source  of  carbon  and  energy  was  either  gluconate  or  ribose 
(Hartl  and  Dykhuizen  1981).  These  gnd  alleles  are  polymorphic  in 
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natural  populations  and  code  for  allozymes  of  6PGD.  Gluconate  is  the 
experimental  condition  to  ascertain  the  effects  on  fitness  of  the  gnd 
alleles,  and  ribose  is  the  control.  In  the  table,  p,  denotes  the  frequency 
of  the  strain  containing  gnd(RM43A)  after  t generations  of  competi- 
tion. From  the  two  points  under  each  growth  condition,  estimate  the 
fitness  of  the  strain  containing  gnd(RM43A)  relative  to  that  containing 
gnd(RM77C)  under  the  growth  condition: 

Growth  medium  p0  pis 

Gluconate  0.455  0.898 

Ribose  0.594  0.587 


ANSWER  In  gluconate  medium,  log  (0.898/0.102)  = log  (0.455/0.545) 
+ 35  x log  w,  and  so  log  w = 0.0292,  or  w = 1.0696.  Hence,  the  allele 
gnd(RM43A)  confers  about  a 7%  selective  advantage  in  competition  for 
utilization  of  gluconate.  In  ribose  medium,  w = 0.999,  a value  that  is 
not  significantly  different  from  1.0,  and  so  the  alleles  appear  to  be 
functionally  equivalent  in  this  environment.  (There  were  more  than 
two  points  in  the  original  data,  and  the  estimates  of  fitness  were  based 
on  the  slope  of  the  linear  regression;  here  we  have  quoted  only  two 
data  points  for  computational  convenience.) 


Continuous  Time 

Bacterial  populations  such  as  those  in  Problem  6.1  do  not  reproduce  in  dis- 
crete generations  but  instead  they  reproduce  continuously.  In  a continuous 
model,  the  exponential  population  growth  of  A and  B are  governed  by  the 
equations  dA(t)/dt  = a A(t)  and  dB(t)/ dt  = b'B(t),  where  a'  and  b'  are  the 
growth  rates.  Therefore,  A(t)  = A(0)  exp  °'f  and  B{t)  = B( 0)  exp  h'(  (Chapter  1) 
and  so 


M>=me^y=men« 

B(t)  B(  0)  B(0) 


6.4 


Equation  6.4  means  that,  in  a continuous  population,  the  outcome  of 
selection  depends  on  the  difference  between  the  exponential  growth  rates 
ci'-V,  which  is  represented  by  the  symbol  m on  the  right-hand  side.  The  value 
of  m also  measures  the  relative  fitness  of  strain  A relative  to  strain  B,  but  in  a 
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continuously  reproducing  population.  Comparing  Equation  6.4  with  Equa- 
tion 6.1  yields  the  relation  between  m and  w: 

m = In  w 6.5 

In  other  words,  the  relative  fitness  with  continuous  growth  m equals  the 
natural  logarithm  of  the  relative  fitness  with  discrete  reproduction  w.  Selec- 
tive neutrality  means  that  iv  = 1 or  that  m-  0.  For  the  values  of  w estimated  in 
Problem  6.1,  the  corresponding  values  of  m are  0.0673  and  -0.001,  respective- 
ly. If  w is  not  too  different  from  1,  then  m = zv  - 1 is  a reasonable  approximation. 

Change  in  Allele  Frequency  in  Haploids 

Although  the  discrete  and  continuous  models  are  completely  equivalent 
under  the  transformation  in  Equation  6.5,  the  equations  for  change  in  allele 
frequency  look  rather  different.  In  the  discrete  model,  the  change  in  the  fre- 
quency of  strain  A in  generation  t is  given  by  the  difference  p,  - pt_X/  which 
can  be  calculated  in  terms  of  p,_x  from  the  formulas  in  Table  6.1.  The  differ- 
ence p,  - pt_x  is  usually  symbolized  A p and,  for  simplicity,  the  subscript  t- 1 is 
suppressed.  Using  the  expressions  in  Table  6.1  and  the  fact  that  q = 1 - p,  we 
obtain 


ap  = -E^—p  = p3^-1) 

piv  + q pw  + q 


6.6 


Not  surprisingly,  p increases  if  the  relative  fitness  of  A is  greater  than  1 
and  decreases  if  the  relative  fitness  of  A is  smaller  than  1.  If  the  relative  fit- 
nesses of  A and  B are  equal,  then  p does  not  change — provided  that  the  pop- 
ulation size  is  very  large  (theoretically,  it  has  to  be  infinite). 

The  analog  of  Equation  6.6  in  a continuous  model  contains  the  derivative 
dp/ dt  in  place  of  A p.  This  we  can  obtain  from  Equation  6.4  with  a little  trick- 
ery. Because  A(t)/B(t)  equals  p(t)/q(t),  the  derivative  of  Equation  6.4  with 
respect  to  t must  equal  the  derivative  of  p(t)/q(t)  with  respect  to  t.  For  sim- 
plicity, we  will  write  p and  q instead  of  p(t)  and  q{t).  The  derivative  of  Equa- 
tion 6.4  with  respect  to  t equals  mp/q  and  the  derivative  of  p/q  with  respect  to 
t equals  (1  / q2)  x dp / dt.  Setting  these  expressions  equal  to  each  other  and 
solving  for  dp/ dt,  we  obtain 


dp 

— = pqm  6.7 

Voila!  There  is  no  denominator!  What  happened  to  it?  In  a technical  sense, 
it  disappeared  into  the  difference  between  the  discrete  model  and  the  contin- 
uous model.  In  a practical  sense,  the  absence  of  a denominator  in  Equation 
6.7  greatly  simplifies  some  of  the  formulas  to  come,  especially  those  con- 
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cerned  with  random  genetic  drift  in  Chapter  7.  Although  they  look  very  dif- 
ferent, Equations  6.6  and  6.7  are  merely  different  ways  of  saying  the  same 
thing.  In  this  chapter,  we  will  deal  mainly  with  expressions  analogous  to 
Equation  6.6  because  they  are  more  easily  derived  for  various  types  of  selec- 
tion. However,  when  it  is  necessary  to  dispose  of  a troublesome  denominator, 
we  will  invoke  the  continuous  model  in  Equation  6.7  and  be  rid  of  it. 

Darwinian  Fitness  and  Malthusian  Fitness 

The  distinction  between  the  fitness  parameters  in  the  discrete  and  continuous 
models  has  been  incorporated  into  the  terminology  of  population  genetics  in 
the  terms  Darwinian  fitness,  which  refers  to  the  discrete  model,  and 
Malthusian  fitness,  which  refers  to  the  continuous  model.  The  latter  is 
named  after  Thomas  Malthus  (1766-1834),  whose  views  on  the  implications 
of  continued  population  growth  strongly  influenced  Darwin's  thinking  on 
the  subject.  A Darwinian  fitness  is  conventionally  represented  by  the  symbol 
w,  often  embellished  with  a subscript,  and  Malthusian  fitness  is  convention- 
ally represented  by  the  symbol  m.  In  this  book,  the  term  fitness,  when  used 
without  qualification,  will  mean  Darwinian  fitness  unless  it  is  clear  from  the 
context  that  some  other  meaning  is  intended. 


SELECTION  IN  DIPLOID  ORGANISMS 

In  diploid  organisms,  the  consequences  of  selection  are  most  conveniently 
explored  under  the  model  of  random  mating  in  Chapter  3,  but  incorporating 
selection  by  permitting  the  fitnesses  of  the  genotypes  to  differ.  Selection  is 
assumed  to  take  place  on  the  diploid  genotypes.  We  shall  use  the  conven- 
tional symbols  wu,  w12,  and  w22  to  represent  the  Darwinian  fitnesses  of  the 
genotypes  AA,  Aa,  and  aa,  respectively.  The  simplest  way  to  interpret  the  fit- 
nesses is  in  terms  of  survivorship,  usually  termed  viability,  which  is  the 
probability  that  a genotype  survives  from  fertilization  to  reproductive  age.  If 
the  fitness  of  each  genotype  is  set  equal  to  its  probability  of  survivorship, 
then  each  fitness  is  an  absolute  fitness  because  its  value  is  independent  of 
the  fitnesses  of  the  other  genotypes.  In  practice,  we  usually  know  only  the 
value  of  the  viability  of  each  genotype  relative  to  that  of  another  genotype 
chosen  as  the  standard  of  comparison.  When  a fitness  value  is  expressed  rel- 
ative to  that  of  another  genotype,  the  fitness  is  a relative  fitness.  The  relative 
fitness  of  the  genotype  chosen  as  the  standard  of  comparison  is  arbitrarily 
assigned  the  value  1. 

To  consider  a specific  example,  suppose  that  the  genotypes  AA,  Aa,  and  aa 
have  probabilities  of  survival  from  conception  to  reproductive  age  of  0.75, 
0.75,  and  0.50,  respectively.  These  are  the  absolute  viabilities  of  the  geno- 
types. They  can  be  judged  realistic  or  not  only  if  we  specify  the  organism. 
They  may  be  plausible  values  if  the  organism  is  a mammal  or  a bird  because 
each  offspring  has  a reasonable  chance  of  survival,  but  implausible  if  the 
organism  is  an  insect  or  an  oyster  because,  in  these  organisms,  most 
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newborns  are  destined  not  to  survive.  Because  selection  depends  on  the  rela- 
tive magnitudes  of  the  viabilities,  it  is  usually  most  convenient  to  express  the 
viabilities  in  relative  terms.  Taking  genotype  AA  as  the  standard,  the  relative 
viabilities  of  AA,  Aa,  and  aa  are  0.75/0.75, 0.75/0.75,  and  0.50/0.75,  or  1.0, 1.0, 
and  0.67,  respectively.  Equivalently,  we  could  choose  genotype  aa  as  the  stan- 
dard, in  which  case  the  relative  viabilities  are  0.75/0.50,  0.75/0.50,  and 
0.50/0.50,  or  1.5, 1.5,  and  1.0,  respectively.  Usually,  the  relative  viabilities  are 
calculated  so  that  the  largest  relative  viability  equals  1.0.  The  relative  viabili- 
ties are  equal  to  the  relative  fitnesses  of  the  genotypes  provided  that  the 
genotypes  are  equally  capable  of  reproduction.  Viabilities  expressed  in  rela- 
tive terms  are  as  valid  for  osprey  as  for  oysters  because  the  relative  fitnesses 
are  the  same  whether  the  absolute  fitnesses  are  0.75, 0.75,  and  0.50  or  0.00075, 
0.00075,  and  0.00050. 

Change  in  Allele  Frequency  in  Diploids 

If  we  write  the  allele  frequencies  of  A and  a as  p,  and  qt,  respectively,  in  gen- 
eration t,  then  it  is  straightforward  to  derive  expressions  for  the  allele  fre- 
quencies in  generation  t in  terms  of  the  allele  frequencies  pt_x  and  qt_i  in  the 
previous  generation.  The  subscripts  f and  t - 1 are  rather  cumbersome  to 
carry  along  in  equations,  so  we  will  use  the  symbols  p and  q for  pt_x  and  q,_v 
and  the  symbols  p'  and  q'  for  p,  and  qt. 

The  relation  between  the  allele  frequencies  in  two  consecutive  generations 
is  deduced  in  Table  6.2,  where  the  fitnesses  wn,  wUr  and  zv22  are  the  relative 
viabilities.  In  generation  t - 1,  the  genotype  frequencies  of  AA,  Aa,  and  aa 


TABLE  6.2  DIPLOID  SELECTION  FOR  SURVIVORSHIP  (VIABILITY) 

Genotype  Total 


Generation  t - 1 

AA 

Frequency  before  selection 

P2 

Relative  fitness  (viability) 

wn 

After  selection 

p2ivn 

Normalized 

P2w  11 

zv 

Generation  t 

Aa 

aa 

2pq 

1 = p2  + 2 pq  + q2 

zuu 

ZV22 

2pqzuu 

q2w22 

zv  = p2w  u + 2pqzvi2+q2zv22 

2pqivu 

qhuz. 

Tv 

ZV 

p2  zt'ii  + pqwu 

zv 

pqzvl2  + q2w22 
w 


Note:  The  allele  frequencies  p and  q are  those  in  gametes  immediately  prior  to  fertilization.  The  AA,  Aa,  and  aa 
zygotes  survive  to  reproductive  maturity  in  the  ratio  wu  : w12 : w22.  All  genotypes,  as  adults,  are  assumed  to 
have  the  same  reproductive  capacity. 
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among  newly  fertilized  eggs  are  given  by  p2, 2pq,  and  q2,  respectively,  assum- 
ing random  mating.  By  definition,  newly  fertilized  eggs  survive  in  the  ratio 
zvu  : zuu  : w22,  and  so  the  ratio  of  AA  : An  : aa  among  surviving  adults  is 

p2wu  ■ 2pqwu : q2w 22 

To  proceed,  we  need  to  convert  the  terms  in  the  above  expression  into  rel- 
ative frequencies  by  dividing  each  term  by  the  sum.  The  value  of  the  sum  is 
indicated  in  Table  6.2  as 


zv  = p2zvn  + 2pqwu  + q2w22 


6.8 


The  symbol  zv  is  the  average  fitness  in  the  population  in  generation  t - 1. 
Division  of  each  term  in  the  ratio  of  survivors  by  w yields  the  genotype  fre- 
quencies among  adults: 


AA  : 


p2m  l 

zv 


Aa:  aa: 

ZV  IV 


6.9 


Among  the  surviving  adults,  the  AA  genotypes  produce  all  A gametes, 
the  Aa  genotypes  produce  V2  A and  y2  a gametes,  and  the  aa  genotypes  pro- 
duce all  a gametes.  Hence,  the  frequencies  of  the  gametes  that  unite  at  ran- 
dom to  form  the  zygotes  of  the  next  generation  are: 

A.  ,_P2u>n+pqwn  a.  , pqw  12  + q2iQn  6 1Q 

w w 

These  are  the  relations  we  were  after  because  they  express  the  allele  fre- 
quencies in  any  generation  in  terms  of  the  allele  frequencies  in  the  previous 
generation.  From  these  equations,  the  outcome  of  selection  can  be  deduced. 

As  in  the  haploid  model,  it  is  often  useful  to  know  A p,  which  is  the  differ- 
ence in  allele  frequency  p'  - p resulting  from  one  generation  of  selection. 
Subtraction  of  p from  the  expression  for  p'  in  Equation  6.10  and  a little  manip- 
ulation leads  to: 


a _ Pea’ll  - WJ2  ) + <7(^12  - H>22  )] 

AP  — = 6.11 

zv 

Equation  6.11  is  the  diploid  analog  of  that  in  the  haploid  model  in  Equa- 
tion 6.6. 

At  this  point,  an  example  of  the  use  of  these  equations  is  in  order.  We  will 
use  data  on  the  change  in  the  frequency  of  the  Cy  ( Curly  wings)  allele  in  a 
laboratory  population  of  Drosophila  melanogaster,  which  are  plotted  in  Figure 
6.2.  The  Cy  allele  is  lethal  when  homozygous,  so  wu  = 0.  The  points  in  Figure 
6.2  pertain  to  the  frequency  of  Cy  heterozygotes  but,  because  Cy/Cy  geno- 
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Figure  6.2  Change  in  frequency  of  adult  Drosophila  melanogaster  heterozygous 
for  the  dominant  mutation  Cy  ( Curly  wings)  in  an  experimental  population.  The 
genotype  Cy/ Cy  is  lethal.  The  curve  represents  the  theoretical  change  in  fre- 
quency when  the  ratio  of  viabilities  of  Cy/+  to  +/+  is  0.5  : 1.  (Data  from  Teissier 
1942.  The  fitness  value  of  0.5  was  estimated  by  Wright  1977.) 


types  do  not  survive,  the  allele  frequency  p of  Cy  equals  one-half  the  fre- 
quency of  Cy/+  adults.  The  points  in  the  figure  are  each  separated  by  one 
generation,  and  the  initial  generation  has  a frequency  of  Cy/+  adults  of  0.67, 
hence  p0  = 0.335  and  thus  q0  = 0.665.  Wright  (1977)  has  studied  these  data  and 
concluded  that  ivu  = 0.5  for  Cy/  + genotypes,  relative  to  a value  of  w22  = 1.0 
for  +/+  genotypes.  Substituting  these  values  for  p,  q,  wn,  wU/  and  iv22  into  the 
expression  for  p'  in  Equation  6.10  yields 

0.3352  x 0 + 0,335  x 0.665  x 0,5 
P ~ 0.3352  x 0 + 2 x 0.335  x 0.665  x 0.5  + 0.6652  x 1 “ 

Therefore,  the  predicted  frequency  of  Cy/+  adults  in  the  generation  1 is 
2 p'  = 0.336,  which  is  reasonably  close  to  the  observed  value  of  0.368. 
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PROBLEM  6.2  Assume  a value  of  p = 0.168  for  the  frequency  of  the  Cy 
allele  in  generation  1 in  the  population  in  Figure  6.2.  Calculate  the  expect- 
ed frequency  of  Cy/  + heterozygotes  among  adults  in  generation  2. 


ANSWER  In  this  case,  p'  = [0.1682  x 0 + 0.168  x 0.832  x 0.5] /re,  where 
w = 0.1682  x 0 + 2 x 0.168  x 0.832  x 0.5  + 0.8322  x 1.0  = 0.832;  hence  p'  = 
0.0699/0.832  = 0.084.  The  expected  frequency  of  Cy/+  adults  is  2 p'  = 
0.168.  This  result  is  very  close  to  the  observed  value  of  0.165.  The 
theoretical  curve  in  Figure  6.2  was  calculated  using  the  same  genera- 
tion-by-generation  algorithm. 


We  make  a slight  digression  to  point  out  that  it  is  sometimes  convenient 
to  think  in  terms  of  the  marginal  fitnesses  of  the  A and  a alleles.  The  mar- 
ginal fitness  equals  the  average  fitness  of  all  genotypes  containing  A or  a, 
respectively,  weighted  by  their  relative  frequency  and  the  number  of  A or  a 
alleles  they  contain.  For  example,  A alleles  are  found  in  AA  and  Aa  genotypes 
in  the  proportions  p and  q and,  therefore,  the  marginal  fitness  a),  of  A-con- 
taining  genotypes  equals  piun  + qivn.  Similarly,  the  marginal  fitness  of  a-con- 
taining  genotypes  is  w2  = pwu  + qw12-  The  expression  for  p'  in  Equation  6.10 
thus  becomes  p'  = pwA/w,  and  Equation  6.11  becomes  A p = p(w1  -w)/w.  This 
expression  makes  it  clear  that  any  allele  increases  in  frequency  if  the  margin- 
al fitness  of  genotypes  containing  the  allele  {wj  is  greater  than  the  average 
fitness  in  the  population  (a>).  This  approach  also  generalizes  readily  to  multi- 
ple alleles:  for  an  allele  with  frequency  p,  and  marginal  fitness  w„  the  change 
in  frequency  in  one  generation  equals 


Ap,  = 


Pi(Wi-ZV) 

w 


6.12 


Time  Required  for  a Given  Change  in  Allele  Frequency 

Flaving  derived  Equation  6.11  for  A p resulting  from  one  generation  of  selec- 
tion, it  is  an  appropriate  next  step  to  express  p,  in  terms  of  p0,  as  we  did  in 
Chapter  5 for  the  analogous  equations  involving  mutation  and  migration.  For 
any  specified  values  of  the  initial  allele  frequencies  and  the  fitness  parame- 
ters, the  allele  frequencies  can  be  determined  generation  after  generation  by 
computer  iteration,  as  in  Problem  6.2.  More  generally  one  might  want  an 
explicit  mathematical  formula  for  p,  in  terms  of  p0,  but  Equation  6.11  does  not 
lend  itself  to  analytical  solution. 
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There  is  an  alternative  approach  based  on  a continuous  model,  however. 
If  the  fitnesses  are  expressed  as  Malthusian  fitnesses  rather  than  as  Darwin- 
ian fitnesses,  then  the  analog  of  Equation  6.11  for  a continuously  growing 
population  is 

^ = pq[p{mu-mlz)  + q(m12-m22)]  6.13 

where  the  values  of  m are  the  malthusian  fitnesses.  Note  that  there  is  no 
denominator  in  Equation  6.13  because  it  disappeared  in  the  same  way  as  the 
denominator  in  Equation  6.7.  A less  elegant  way  to  derive  an  equation  like 
Equation  6.13  is  to  suppose  that  the  Darwinian  fitnesses  are  all  quite  close  to 
1;  then  the  change  in  allele  frequency  is  slow  enough  that  A p = dp/dt  and,  fur- 
thermore, w b 1.  Under  these  conditions.  Equation  6.11  takes  the  form  of 
Equation  6.13  with  the  m values  replaced  with  w values. 

To  solve  Equation  6.13,  the  terms  are  rearranged  to  isolate  those  in  p on 
one  side  and  those  in  t on  the  other,  then  one  side  is  integrated  over  p from  p0 
to  pt  and  the  other  integrated  over  t from  0 to  t.  The  details  are  left  as  an  exer- 
cise. The  answers  are  most  easily  presented  if  we  change  the  symbols.  For 
this  purpose,  we  rewrite  the  fitnesses  of  the  genotypes  as  follows: 

w n = 1 zvu  = 1 - hs  u >22  = 1 - s 

mn  = 0 mu  = -hs  m22  = -s 

where  the  Malthusian  fitnesses  follow  from  the  approximation  = wlt  — 1 
when  Wjj  = 1.  Use  of  the  h and  s symbols  for  the  fitnesses  has  the  advantage  of 
making  the  amount  of  selection  and  the  degree  of  dominance  explicit. 

If  s is  positive  and  h is  not  negative,  selection  favors  genotypes  carrying 
the  A allele.  In  this  context,  s is  called  the  selection  coefficient  against  the  aa 
genotype,  and  h is  called  the  degree  of  dominance  of  the  a allele.  For  exam- 
ple, when  h = 0,  the  Darwinian  fitnesses  of  AA,  Aa,  and  aa  are  1, 1,  and  1 - s, 
respectively,  and  a is  completely  recessive  to  A.  Alternatively,  when  h = 1,  the 
Darwinian  fitnesses  are  1, 1 - s,  and  1 - s,  respectively,  and  a is  completely 
dominant  to  A.  In  terms  of  the  selection  coefficient  and  the  degree  of  domi- 
nance, dp/dt  of  Equation  6.13  becomes 

^ = pqs[ph  + q(l-h)]  6.14 

The  following  equations  give  p,  in  terms  of  p0  in  three  cases  of  importance. 

• A is  a favored  dominant.  In  this  case  h = 0.  Then  dp/dt  = pq2s,  and 
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• A is  favored  and  the  alleles  are  additive  in  their  effects  on  fitness.  Addi- 
tive effects  on  fitness  means  that  the  fitness  of  the  heterozygote  is  exactly 
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intermediate  between  the  fitnesses  of  the  homozygotes,  and  so  h = V2.  The 
additive  case  is  also  referred  to  as  semidominance  or  as  genic  selection. 
When  h = V2,  then  dp/dt  = pcjs/2,  and 


In 


= In 


yh 


6.16 


Note  that  Equation  6.16  for  additive  alleles  is  similar  in  form  to  Equation 
6.3  for  haploid  selection  when  w = 1 + s/2  and  s is  small.  In  other  words, 
slow  selection  of  additive  alleles  in  a diploid  species  is  mathematically 
almost  equivalent  to  selection  in  a haploid  species.  In  Problem  6.3,  you 
will  see  that  the  precise  requirement  is  w12  = 

• A is  a favored  recessive.  In  this  case,  h = 1,  so  dp/dt  = p2qs,  and 


In 


1 

= In 

V 

yty  ) 

Pi 

,10  , 

6.17 


Some  of  the  practical  implications  of  these  equations  are  explored  below. 
Problem  6.3  explores  a little  more  deeply  the  relation  between  selection  in 
haploid  species  and  selection  in  diploid  species.  Figure  6.3  illustrates  the 
changes  in  allele  frequency  for  Equations  6.15  through  6.17. 


PROBLEM  6.3  The  discrete  model  of  selection  in  a haploid  species 
is  completely  equivalent  to  that  in  a diploid  species  if,  in  the  diploid, 
the  Darwinian  fitness  of  the  heterozygote  equals  the  geometric  mean 
of  the  Darwinian  fitnesses  of  the  homozygotes — that  is,  if  zvn  = 
V(renw22).  Show  that,  in  this  case,  Equation  6.3  for  A p in  a haploid 
species  is,  indeed,  identical  to  Equation  6.11  for  A p in  a diploid 
species.  What  is  the  equivalent  value  of  w in  the  haploid  in  terms  of 
the  Darwinian  fitnesses  in  the  diploid? 


ANSWER  Substitute  w12  = into  Equation  6.11.  The  num- 

erator  simplifies  to  pq  x (/ion  - Vzc22)  x (pvwn  + q^w22)-  The  denomi- 
nator simplifies  to  Therefore, 
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Number  of  generations 

Figure  6.3  The  change  in  frequency  p of  a favorable  allele  that  is  either  domi- 
nant, additive,  or  recessive  in  its  effect  on  fitness.  The  frequency  of  a favored 
dominant  allele  changes  most  slowly  when  the  allele  is  common,  and  the  fre- 
quency of  a favored  recessive  allele  changes  most  slowly  when  the  allele  is  rare. 
In  all  three  examples,  the  difference  in  relative  fitness  between  the  homozygous 
AA  and  an  genotypes  is  assumed  to  be  five  percent. 


This  is  in  exactly  the  same  form  as  Equation  6.6  with  w = 'Jwu/w22- 
Taking  W22  = 1 as  the  standard,  w in  the  haploid  model  equals  the  fit- 
ness of  the  heterozygote  in  the  diploid  model.  More  specifically,  let 
ivu  = (1  + s/2)2,  wu  = 1 + s/2,  and  W22  = 1.  If  s is  small  compared  to  1, 
then  rt?u  = (1  + s/2)2  = 1 + s,  which  implies  that  the  Darwinian  fitness- 
es are  approximately  additive.  Furthermore,  A p = pqs/2,  which  has 
the  same  form  as  dp/dt  in  the  additive  case  leading  to  Equation  6.17. 


PROBLEM  6.4  A certain  highly  isolated  colony  of  the  moth  Panax- 
ia  dominula  near  Oxford,  England,  was  intensively  studied  by  Ford 
and  collaborators  over  the  period  1928  to  1968  (Ford  and  Sheppard 
1969).  This  colony  contained  a mutant  allele  affecting  color  pattern. 
The  frequency  of  the  mutant  allele  declined  steadily  over  the  period 
1939  to  1968.  Indeed,  the  accompanying  steady  increase  in  the  fre- 
quency of  the  normal  allele  followed  Equation  6.16  for  additive 
genes  with  s = 0.20  (Wright,  1978,  shows  a graph).  The  species  has 
one  generation  per  year,  and  the  estimated  frequency  of  the  mutant 
allele  in  1965  was  0.008.  (This  value  is  actually  the  average  for  the 
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seven-year  period  1962  to  1968.)  Estimate  the  frequency  of  the 
mutant  allele  in  1950  and  in  1940. 


ANSWER  Here  we  are  given  q,  and  want  to  use  Equation  6.16  to 
estimate  q0.  Between  1950  and  1965,  there  were  t = 1965  - 1950  = 15 
generations.  We  are  given  q,  = 0.008,  hence  p,  = 0.992  and  In  (0.992/ 
0.008)  = 4.820.  Thus,  4.820  = In  ( p0 / q0)  + (0.20/2)  x 15,  or  In  (po/qo)  = 
3.32.  Then  p0/q0  = 27.660,  or  p0  = 0.965  and  q0  = 0.035.  For  the  year 
1940,  t = 1965  - 1940  = 25  generations,  from  which  p0  = 0.911  and  q0  = 
0.089.  (You  may  be  interested  to  know  that  observations  made  at  the 
time  yielded  estimates  of  q0  = 0.037  in  1950  and  q0  = 0.111  in  1940.) 


Application  to  the  Evolution  of  Insecticide  Resistance 

Some  of  the  most  dramatic  examples  of  evolution  in  action  result  from  the 
natural  selection  for  chemical  pesticide  resistance  in  natural  populations  of 
insects  and  other  agricultural  pests.  In  the  1940s,  when  chemical  pesticides 
were  first  used  on  a large  scale,  an  estimated  7%  of  the  agricultural  crops  in 
the  United  States  were  lost  to  insects.  Initial  successes  in  chemical  pest  man- 
agement were  followed  by  gradual  loss  of  effectiveness.  Today,  more  than  400 
pest  species  have  evolved  significant  resistance  to  one  or  more  pesticides,  and 
13%  of  the  agricultural  crops  in  the  United  States  are  lost  to  insects  (May 
1985).  In  many  cases,  significant  pesticide  resistance  has  evolved  in  5 to  50 
generations  irrespective  of  the  insect  species,  geographical  region,  pesticide, 
frequency  and  method  of  use,  and  other  seemingly  important  variables  (May 
1985).  Equations  6.15  through  6.17  help  to  understand  this  apparent  paradox 
because  many  of  the  resistance  phenotypes  result  from  single  mutant  alleles. 
The  resistance  alleles  are  often  partially  or  completely  dominant,  so  Equa- 
tions 6.15  and  6.16  are  applicable.  Prior  to  use  of  the  pesticide,  the  allele  fre- 
quency p0  of  the  resistant  mutant  is  generally  close  to  0.  Use  of  the  pesticide 
increases  the  allele  frequency,  sometimes  by  many  orders  of  magnitude,  but 
significant  resistance  is  noticed  in  the  pest  population  even  before  the  allele 
frequency  p,  increases  above  a few  percent.  Thus,  as  rough  approximations, 
we  may  assume  that  q0  and  q,  are  both  close  enough  to  1 that  In  (p0/q0)  = In  p0 
and  In  (pt/ qt)  = pt.  Using  these  approximations,  Equation  6.16  (additive  case) 
implies  that  t = (2/s)  x In  ( pt / p0)  and  Equation  6.15  (dominant  case)  implies 
that  f = (1/s)  x In  (p,/p0).  In  many  instances,  the  ratio  pt/p0  may  range  from  1 
x 10“  to  perhaps  1 x 10',  and  s may  typically  be  0.5  or  greater.  Over  this  wide 
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range  of  parameter  values,  the  time  t is  effectively  limited  to  a range  of  5 to  50 
generations  for  the  appearance  of  a significant  degree  of  pesticide  resistance. 
Details  in  actual  examples  depend  on  such  factors  as  effective  population 
number  and  extent  of  genetic  isolation  between  local  populations.  An  exam- 
ple of  the  global  spread  of  an  insecticide-resistance  allele  is  given  in  Chapter 
8.  The  evolution  of  resistance  caused  by  multiple  interacting  alleles  may  be 
expected  to  take  somewhat  longer  than  single-gene  resistance. 


PROBLEM  6.5  In  the  discussion  of  the  evolution  of  insecticide  resis- 
tance, we  used  the  approximation  t s (1/s)  x In  (pt/Po)  for  the  domi- 
nant case  and  t s (2/s)  x In  (pt/po)  for  the  semidominant  case. 
Evaluate  the  adequacy  of  the  approximations  for  the  values  in  the 
accompanying  table  by  comparing  them  with  the  more  exact  values 
calculated  from  Equations  6.15  and  6.16. 


Example  no. 

Po 

Pt 

5 

1 

lxKT4 

0.01 

0.50 

2 

1 xHT4 

0.10 

0.50 

3 

lxlCT4 

0.50 

0.50 

4 

1 x 1CT7 

0.10 

0.50 

5 

i x ter4 

0.10 

0.20 

ANSWER  The  approximations  are  quite  acceptable  for  the  exam- 
ples. The  more  exact  and  approximate  values  are  as  follows: 


Example  no. 

Eqn.  6.  IS 

Approximation 

Eqn.  6.16 

Approximation 

1 

9.3 

9.2 

18.5 

18.4 

2 

14.2 

13.8 

28.1 

27.6 

3 

20.4 

17.0 

36.8 

34.1 

4 

28.1 

27.6 

55.7 

55.3 

5 

35.6 

34.5 

70.1 

69.1 

EQUILIBRIA  WITH  SELECTION 

An  equilibrium  value  of  p in  a discrete  model  is  any  value  for  which  A p = 0. 
When  the  allele  frequency  is  at  an  equilibrium  in  an  infinite  population,  the 
allele  frequency  remains  the  same  generation  after  generation.  Because  real 
populations  are  finite  in  size,  an  allele  frequency  is  subject  to  chance  fluctua- 
tions and  so  cannot  usually  remain  exactly  at  an  equilibrium  value.  For  any 
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equilibrium,  therefore,  it  is  important  to  consider  how  the  allele  frequency 
behaves  when  it  is  close,  but  not  exactly  equal,  to  the  equilibrium  value.  Any 
equilibrium  can  be  classified  as  one  of  several  different  types  according  to  the 
behavior  of  the  allele  frequency  when  it  is  near  the  equilibrium: 

• An  equilibrium  is  said  to  be  locally  stable  if  the  allele  frequency,  when  it 
is  already  close  to  the  equilibrium,  moves  progressively  closer  in  subse- 
quent generations.  A locally  stable  equilibrium  may  also  be  globally  sta- 
ble. This  term  means  that  the  allele  frequency  always  moves  toward  the 
equilibrium  regardless  of  where  it  starts,  even  if  initially  far  away  from 
the  equilibrium.  A polymorphism  with  a stable  equilibrium  is  sometimes 
called  a balanced  polymorphism. 

• An  equilibrium  is  unstable  if  the  allele  frequency,  initially  close  to  the 
equilibrium,  moves  progressively  farther  away  in  subsequent  genera- 
tions. 

• An  equilibrium  is  called  neutrally  stable  or  semistable  if  the  allele  fre- 
quency has  no  tendency  to  change  regardless  of  its  initial  value.  In  such  a 
case,  every  allele  frequency  represents  an  equilibrium  because  Ap  = 0 
whatever  the  value  of  p.  This  type  of  equilibrium  is  exemplified  by  the 
Hardy- Weinberg  principle  in  an  infinite  population  (Chapter  3). 

The  concepts  of  stability  can  be  applied  to  the  case  of  selection  governed 
by  Equation  6.11  in  which  A is  the  favored  allele.  For  A to  be  favored,  we 
need  wu  > w12  > w22,  and  at  least  one  of  the  strict  inequalities  must  be  true.  In 
such  a case,  there  are  only  two  equilibria,  namely  p = 0 and  p = 1.  Except  for 
P — 0 and  p — 1,  when  Ap  = 0,  it  is  always  true  that  Ap  > 0.  Hence,  if  p is  close 
to  0,  its  value  increases  (moving  it  farther  away  from  0),  and  so  the  equilibri- 
um at  p = 0 is  unstable.  On  the  other  hand,  if  p is  near  1,  it  moves  still  closer  to 
1 (because  Ap  > 0),  and  so  the  equilibrium  at  p = 1 is  locally  stable.  In  this 
example,  p eventually  goes  to  1 whatever  its  initial  value,  and  so  the  equilib- 
rium at  p = 1 is  globally  stable  also. 

Overdominance 

With  two  alleles  of  a gene  in  a diploid  organism,  there  is  the  possibility  that 
the  heterozygous  genotype  has  the  highest  fitness  or  that  the  heterozygous 
genotype  has  the  lowest  fitness.  These  cases  illustrate  equilibria  in  which  the 
equilibrium  value  of  p is  between  0 and  1. 

Overdominance,  also  called  heterozygote  superiority,  is  the  term  applied 
when  the  heterozygote  has  a higher  fitness  than  both  homozygotes.  Symbol- 
ically, heterozygote  superiority  means  that  w12  > wu  and  simultaneously 
wn  > w22.  With  overdominance,  p = 0 and  p = 1 are  both  equilibria  because, 
according  to  Equation  6.11,  Ap  = 0 at  these  values.  There  is  also  a third  equi- 
librium made  possible  by  the  fact  that  p(wn  - wn)  + q(w12  - w22)  can  equal  0. 
The  equilibrium  frequency  of  A is  conventionally  denoted  p;  hence  the  equi- 
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librium  allele  frequency  of  a is  q = 1 - p.  The  equilibrium  can  be  found  by 
solving  p(wu  - zv12)  + q(wi2  - u>22)  = 0,  from  which  a little  algebra  gives 

~ IV 7X^22  , „ _ 

p = - 6.18 

2 W\2  ~ ~ W22 

Equation  6.18  is  often  encountered  in  another  form  in  which  the  fitnesses 
are  all  expressed  relative  to  that  of  the  heterozygote  by  setting  wn  = 1 - s, 
wu  = 1/  and  w22  = 1 - f.  (This  formulation  is  proposed  at  the  risk  of  some  con- 
fusion because  t is  now  the  selection  coefficient  against  aa  rather  than  the 
time  in  generations.)  With  these  substitutions.  Equation  6.18  becomes 

t 


This  relationship  makes  a lot  of  intuitive  sense  because  it  implies  that  greater 
selection  against  aa  increases  the  equilibrium  frequency  p of  A. 

The  overdominance  equilibrium  in  Equation  6.18  is  globally  stable  where- 
as those  at  p = 0 and  p = 1 are  unstable.  The  time  course  is  indicated  in  Figure 
6.4A,  where  the  arrowheads  show  the  direction  of  change  in  allele  frequency. 
Figure  6.4B  shows  the  change  in  w with  overdominance.  The  average  fitness 


Figure  6.4  Selection  when  there  is  overdominance.  (A)  The  allele  frequencies 
converge  to  an  equilibrium  value  irrespective  of  the  initial  frequency.  In  this 
example,  ivn  = 0.9,  wl2  = 1,  and  w22  = 0.8,  and  the  equilibrium  frequency  of  the  A 
allele,  p,  is  0.667.  (B)  Average  fitness  w against  p for  the  same  example.  Note  that 
w is  a maximum  at  equilibrium. 
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in  the  population  is  maximized  at  the  stable  equilibrium.  Maximization  of 
average  fitness  is  a frequent  outcome  of  selection  in  random-mating  popula- 
tions with  constant  fitnesses.  There  are,  however,  many  exceptions  when 
mating  is  nonrandom,  when  the  fitnesses  are  not  constant,  or  when  there  are 
interactions  between  alleles  of  different  genes  (Ewens  1979;  Curtsinger  1984). 
Note  particularly  that  w is  the  average  fitness  in  the  population,  not  the  aver- 
age fitness  of  the  population.  The  relative  survivorships  zvn,  wu,  and  w2 2 are 
relevant  only  to  the  differential  mortality  of  the  genotypes  within  a popula- 
tion at  any  given  time.  The  average  of  the  relative  survivorships  is  the  aver- 
age  "fitness"  w in  the  population.  However,  w has  no  necessary  relation  to 
vernacular  meanings  of  "fitness"  such  as  competitive  ability,  population  size, 
production  of  biomass,  or  evolutionary  persistence  (Haymer  and  Hard  1982). 

Although  overdominance  is  one  mechanism  for  the  maintenance  of  poly- 
morphisms in  natural  populations,  it  has  been  documented  in  only  a few  cases. 
The  classic  case  is  sickle-cell  anemia  in  human  beings,  which  is  prevalent  in 
many  populations  at  risk  for  the  type  of  malaria  caused  by  the  mosquito-borne 
protozoan  parasite  Plasmodium  falciparum  (Figure  6.5).  The  anemia  is  caused  by 
an  allele  S that  codes  for  a variant  form  of  the  P chain  of  hemoglobin.  In  per- 
sons of  genotype  SS,  many  red  blood  cells  assume  a curved,  elongated  shape 
("sickling")  and  are  removed  from  circulation.  The  result  is  a severe  anemia  as 
well  as  pain  and  disability  owing  to  the  accumulation  of  defective  cells  in  the 
capillaries,  joints,  spleen,  and  other  organs.  In  tire  absence  of  intensive  medical 
care,  persons  of  genotype  SS  usually  do  not  survive.  The  S allele  is  maintained 
at  a relatively  high  frequency  because  persons  of  genotype  AS,  in  which  A is  the 
nonmutant  allele,  have  only  a mild  form  of  the  anemia  but  are  quite  resistant  to 
malaria,  perhaps  because  red  blood  cells  infested  with  the  parasite  undergo 
sickling  and  are  removed  from  circulation.  Homozygous  AA  people  are  not  ane- 
mic but,  on  the  other  hand,  are  the  most  sensitive  to  severe  malaria.  The  result 
of  the  offsetting  sickle-cell  anemia  and  malaria  resistance  is  that  the  heterozy- 
gotes have  the  highest  fitness.  In  regions  of  Africa  in  which  malaria  is  common, 
the  viabilities  of  AA,  AS,  and  SS  genotypes  have  been  estimated  as  wu  = 0.9, 
w\2  ~ 1/  ar|d  w22  = 0.2,  respectively  (Cavalli-Sforza  and  Bodmer  1971;  Templeton 
1982).  Substitution  into  Equation  6.18  leads  to  a predicted  equilibrium  allele  fre- 
quency for  A of  p = 0.89.  Consequently,  that  of  S is  0.11.  This  value  is  reasonably 
close  to  the  average  allele  frequency  of  0.09  across  West  Africa,  but  there  is  con- 
siderable variation  in  allele  frequency  among  local  populations. 


PROBLEM  6.6  Experimental  populations  of  Drosophila  pseudoobscu- 
ra  were  periodically  treated  with  weak  doses  of  the  insecticide  DDT. 
One  population  was  initially  polymorphic  for  five  different  inversions 
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Figure  6.5  The  medium  gray  areas  show  the  incidence  of  falciparum  malaria 
in  Africa,  the  Middle  East,  and  southern  Europe  in  the  1920s  before  mosquito 
control  programs  were  implemented.  The  light  gray  areas  are  regions  with  a 
high  incidence  of  sickle-cell  anemia.  The  extensive  overlap  in  the  distributions 
(darkest  shade)  was  an  early  indication  that  there  might  be  some  causal  connec- 
tion. (After  Cavalli-Sforza  1974.) 


of  the  third  chromosome.  After  13  generations,  three  of  the  inversions 
had  essentially  disappeared  from  the  population.  The  two  that 
remained  were  Standard  (ST)  and  Arrowhead  ( AR ).  Changes  in  fre- 
quency of  each  inversion  were  monitored  and,  from  the  values  for  the 
first  nine  generations,  the  relative  fitnesses  of  ST/ST,  ST/AR,  and 
AR/AR  genotypes  were  estimated  as  0.47, 1.0,  and  0.62,  respectively 
(DuMouchel  and  Anderson  1968).  Because  the  inversions  undergo 
almost  no  recombination,  each  type  can  be  considered  as  an  "allele." 
What  equilibrium  frequency  of  ST  is  predicted?  What  equilibrium 
value  of  w is  predicted? 
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ANSWER  From  Equation  6.18,  p = (1.0-  0.62)/ (2.0  - 0.47  - 0.62)  = 
0.42.  (The  observed  value  after  13  generations  was  0.43.)  The  predict- 
ed equilibrium  value  of  w,  from  Equation  6.8,  equals  0.4222  x 0.47  + 2 
x 0.42  x 0.58  x 1.0  + 0.582  x 0.62  = 0.78. 


PROBLEM  6.7  Warfarin  is  a blood  anticoagulant  used  for  rat  control 
in  World  War  II  and  afterward.  Initially  highly  successful,  the  effec- 
tiveness of  the  rodenticide  gradually  diminished  owing  to  the  evolu- 
tion of  resistance  among  some  target  populations.  Among  Norway 
rats  in  Great  Britain,  resistance  results  from  an  otherwise  harmful 
mutation  R in  a gene  in  which  the  normal  nonresistant  allele  may  be 
denoted  S.  In  the  absence  of  warfarin,  the  relative  fitnesses  of  SS,  SR, 
and  RR  genotypes  have  been  estimated  as  1.00, 0.77,  and  0.46  respec- 
tively. In  the  presence  of  warfarin,  the  relative  fitnesses  have  been  esti- 
mated as  0.68, 1.00,  and  0.37,  respectively  (May  1985).  The  reduced 
fitness  of  the  RR  genotype  appears  to  result  from  an  excessive  require- 
ment for  vitamin  K.  Calculate  the  equilibrium  frequency  q of  R in  the 
presence  of  warfarin.  Noting  that,  in  the  absence  of  warfarin,  R and  S 
are  very  nearly  additive  in  their  effects  on  fitness,  estimate  the 
approximate  number  of  generations  required  for  the  allele  frequency 
of  R to  decrease  from  q to  0.01  in  the  absence  of  the  poison. 


ANSWER  From  Equation  6.18,  the  equilibrium  frequency  p of  S 
equals  (1.00  - 0.37)/ (2  - 0.68  - 0.37)  = 0.66,  and  so  q of  R = 0.34.  Set- 
ting <7o  = 0-34  and  qt  = 0.01  in  Equation  6.16,  with  s = 1.00  - 0.46  = 0.54, 
yields  t = 14.6  generations.  (The  approximation  is  very  good  even 
though  s is  large;  the  exact  value  is  14  generations.) 


Local  Stability 

Although  the  curves  in  Figure  6.4A  indicate  that  the  interior  equilibrium  is 
locally  stable  when  there  is  overdominance,  an  alternative  approach  is  also 
applicable  to  the  analysis  of  local  stability  in  models  of  much  greater 
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complexity.  It  is  based  on  the  expression  for  A p in  Equation  6.11.  To  empha- 
size that  A p is  a function  of  p,  we  will  write  it  as  an  explicit  function,  A(p).  The 
local  stability  of  an  equilibrium  depends  on  the  behavior  of  A (p)  for  a value  of 
p close  to,  but  not  equal  to,  the  equilibrium,  as  illustrated  in  Figure  6.6.  It  is 
convenient  to  write  A (p  + e)  as  the  change  in  allele  frequency  when  the  start- 
ing point  is  a small  deviation,  e,  from  any  allele  frequency  p.  The  function 
A(p  + e)  can  be  expanded  term  by  term  into  an  infinite  sum: 


A(p  + e) : 


A (p)  i <fA(p)  c i ^A(p)  £2  i g3  i 


dp 


dp 2 2!  dp3  3! 


The  mathematical  basis  of  this  type  of  expansion  is  beyond  the  scope  of 
the  book.  If  you  are  unfamiliar  with  it  and  want  to  look  it  up,  you  will  find  it 
under  the  heading  the  Taylor  series  in  most  textbooks  of  calculus.  It  is  named 
after  the  mathematician  Brook  Taylor  (1685-1731). 

The  value  of  the  Taylor  series  expansion  is  that,  when  e is  sufficiently 
small,  then  all  terms  in  e2  and  higher  can  be  ignored.  Therefore,  for  any  value 


Figure  6.6  The  change  in  allele  frequency  A p plotted  as  a function  of  allele  fre- 
quency p for  a case  of  overdominance  in  which  wn  = 0.6,  iu]2  = 1,  and 
w2 2 = 0.2.  Starting  with  an  allele  frequency  p0,  smaller  than  the  equilibrium 
value,  the  positive  value  of  A p0  indicates  that  the  allele  frequency  in  the  next 
generation,  pv  will  be  greater  than  p0  because  p2=p0  + A p0.  At  an  allele  frequen- 
cy of  pi,  the  value  of  A px  is  also  positive,  and  so  p2  is  greater  than  p1  because 
p2  = Pi  + Ap  | . The  steady  increase  continues  until  the  population  arrives  at  the 
equilibrium  point  p.  The  same  logic  shows  that,  starting  with  an  initial  allele 
frequency  greater  than  p,  the  allele  frequency  decreases  in  each  succeeding  gen- 
eration and  ultimately  converges  to  the  equilibrium  from  the  other  side. 
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of  p,  we  can  approximate  A (p  + e)  in  terms  of  A (p)  itself  and  its  first  derivative. 
Furthermore,  if  p is  one  of  the  equilibrium  points,  then  A (p)  = 0 by  definition, 
and  so  the  sign  of  A (p  + e)  depends  on  the  sign  of  first  derivative  of  A(p)  eval- 
uated at  the  equilibrium  in  question.  By  definition,  an  equilibrium  is  locally 
stable  if  the  allele  frequency,  starting  at  a point  near  the  equilibrium,  moves 
ever  closer  to  the  equilibrium.  In  symbols,  this  means  that  A(p  + e)  < 0 if  e > 0 
and  A (p  + e)  > 0 if  e < 0.  Therefore,  any  equilibrium  point,  denoted  generical- 
ly  as  p,  is  locally  stable  if,  and  only  if, 


dA(p) 

dp 


<0 


6.19 


where  the  vertical  line  and  p mean  that  the  derivative  should  be  evaluated  at 
the  equilibrium  in  question. 

In  practice,  calculating  the  derivative  of  A(p)  can  be  quite  tedious  without 
the  use  of  computer  software  like  Mathematica  to  do  the  algebraic  manipula- 
tions. The  result  of  differentiating  Equation  6.11  is  that 


rfA(p)  _ pqzv  + (q-p)(p-  p) zv  2 pq(p  - p)zw 2 
dp  w w w2 

where  zv  = wn  - 2 wu  + zv2 2-  With  overdominance,  zv  < 0.  Note  that,  when 
dA(p)/  dp  is  evaluated  at  p = 0 or  p = 1,  both  the  first  and  last  terms  equal  0; 
when  it  is  evaluated  at  p = p,  the  second  and  last  terms  equal  0.  The  stability 
analysis  proceeds  as  follows: 

• At  p = 0,  sign  [dA(p)/dp]  = -sign  (zv)  > 0; 

• At  p = p,  sign  [dA(p)/dp]  = sign  (zv)  < 0; 

• At  p = 1,  sign  [dA(p)/dp]  = -sign  (zv)  > 0. 

Therefore,  as  is  already  clear  from  Figure  6.4A,  the  equilibrium  points  at 
0,  p,  and  1 are  unstable,  locally  stable,  and  unstable,  respectively.  This  stabil- 
ity analysis  is  predicated  on  the  assumption  of  heterozygote  superiority, 
which  implies  that  zv  < 0.  Exactly  the  same  equilibrium  points  are  present 
when  there  is  heterozygote  inferiority,  but  then  zv  > 0,  which  means  that  the 
stability  property  of  each  equilibrium  point  is  reversed.  This  situation  is  dis- 
cussed next. 


Heterozygote  Inferiority 

Heterozygote  inferiority  means  that  the  fitness  of  the  heterozygous  geno- 
type is  smaller  than  that  of  both  homozygotes:  wu  < zv n and  ivu  < zv22-  An 
interior  equilibrium,  given  by  Equation  6.18,  exists  in  this  case  also.  The 
analysis  in  the  previous  section  indicates  that  this  equilibrium  is  unstable, 
whereas  the  equilibria  at  p = 0 and  p = 1 are  both  locally  (but  not  globally)  sta- 
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Figure  6.7  Selection  when  there  is  heterozygote  inferiority.  (A)  The  allele 
frequency  goes  to  0 or  1 depending  on  the  initial  frequency.  In  this  example, 
iuu  = 1,  wu  = 0.8,  and  w2 2 = 0.9,  and  there  is  an  unstable  equilibrium  when  the 
frequency  of  the  A allele  is  p = 0.333.  An  infinite  population  with  p = V3  main- 
tains this  frequency,  but  any  slight  upward  change  in  the  frequency  of  A results 
in  eventual  fixation,  and  any  slight  downward  change  in  the  frequency  of  A 
results  in  ultimate  loss.  (B)  Average  fitness  zZ>  against  p for  the  same  example. 
The  unstable  equilibrium  represents  the  minimum  of  id. 


ble.  An  example  of  heterozygote  inferiority  is  depicted  in  Figure  6.7A,  where 
the  arrows  again  denote  the  direction  of  change  in  allele  frequency.  If  the 
initial  allele  frequency  is  exactly  equal  to  the  equilibrium  value  (in  this  exam- 
ple, p = y3),  then  the  allele  frequency  remains  at  that  value.  In  all  other  cases, 
p goes  to  1 or  0 depending  on  whether  the  initial  allele  frequency  was  above 
or  below  the  equilibrium  value. 

Figure  6.7B  shows  the  change  in  average  fitness.  The  unstable  equilibrium 
at  p = 1/3  is  the  minimum  average  fitness.  The  shape  of  the  w curve  has  an 
important  implication  that  carries  over  to  more  complex  examples.  Imagine  a 
population  with  an  allele  frequency  near  0,  at  which  w = 0.9.  In  terms  of  aver- 
age fitness  in  the  population,  the  population  would  be  better  off  if  the  allele 
frequency  were  near  1,  because  then  w = 1.0.  However,  as  shown  by  the  direc- 
tion of  the  arrows,  the  population  cannot  evolve  toward  p = 1.  It  cannot  get 
through  the  "valley"  because  p = 0 is  a locally  stable  equilibrium.  The  popu- 
lation has  no  way  to  escape  from  the  equilibrium  even  though,  in  doing  so,  it 
would  eventually  end  up  with  a greater  average  fitness.  This  consideration 


236 


Chapter  6 


would  seem  to  limit  the  ability  of  natural  selection  to  increase  average  fitness 
in  such  cases,  but  one  way  out  of  the  impass  is  suggested  in  the  next  section. 

The  Adaptive  Topography  and  the  Role  of  Random  Genetic  Drift 

Any  graph  of  w against  allele  frequency  is  called  an  adaptive  topography. 
The  simplest  example  is  Figure  6.7B.  In  order  to  generalize  the  example,  try 
to  imagine  an  adaptive  topography  in  many  dimensions  with  w a function 
of  the  allele  frequencies  at  many  loci.  In  many  dimensions,  the  adaptive 
topography  is  a complex  surface  upon  which  there  may  be  "peaks"  and 
pits  and  even  "saddle-shaped"  regions.  The  peaks  represent  locally  stable 
equilibria.  Even  if  natural  selection  changes  the  allele  frequencies  so  as  to 
move  w to  the  top  of  some  peak,  the  peak  it  perches  on  may  not  be  the  high- 
est peak  that  exists  on  the  whole  surface.  However,  as  illustrated  in  Figure 
6.7B,  the  population  may  become  stuck  there  because  the  peak  is  a locally  sta- 
ble equilibrium. 

By  what  process  can  a population  stranded  on  a submaximal  fitness  peak 
get  off  the  peak?  To  do  so,  it  has  to  travel  through  a nearby  valley  to  a place 
where  natural  selection  can  carry  it  to  the  top  of  an  even  higher  fitness  peak. 
This  is  something  that  natural  selection  acting  alone  cannot  accomplish  because 
it  entails  a temporary  reduction  in  fitness.  There  is,  however,  a process  that  can 
accomplish  the  task — random  genetic  drift.  In  a sufficiently  small  population, 
the  allele  frequencies  can  change  by  chance,  even  producing  a reduction  in  aver- 
age fitness.  Theoretically,  random  genetic  drift  can  shift  a population  from  a 
locally  stable  equilibrium,  through  a nearby  valley,  and  into  a region  where  it  is 
attracted  by  another  locally  stable  equilibrium  toward  a higher  fitness  peak. 
Random  genetic  drift  can  therefore  play  a crucial  role  in  evolution  by  allowing  a 
population  to  explore  the  full  range  of  its  adaptive  topography.  This  role  of  ran- 
dom genetic  drift  has  been  particularly  emphasized  by  Wright  (1977  and  earli- 
er) in  his  proposed  shifting  balance  theory  of  evolution.  Additional  discussion 
of  the  theory  is  found  in  this  chapter's  section  on  interdemic  selection;  see  also 
Hard  (1979),  Provine  (1986),  and  Coyne  et  al.  (1997). 

MUTATION-SELECTION  BALANCE 

You  may  recall  from  Chapter  4 that  outcrossing  species  typically  contain  a 
large  amount  of  hidden  genetic  variability  in  the  form  of  recessive,  or  nearly 
recessive,  harmful  alleles,  each  present  at  a low  frequency.  Now  we  can 
explain  why  harmful  alleles  are  not  completely  eliminated.  Selection  cannot 
eliminate  them  because  they  are  continually  created  anew  through  recurrent 
mutation.  To  be  specific,  suppose  that  a is  a harmful  allele  of  the  wildtype  A 
and  that  mutation  of  A to  a takes  place  at  the  rate  p per  generation.  Because 
the  allele  frequency  of  a,  which  we  call  q,  remains  small,  reverse  mutation  of 
a to  A can  safely  be  ignored.  The  calculation  of  p'  carried  out  to  obtain 
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Equation  6.10  is  still  valid,  except  that  a proportion  p of  A alleles  mutate  to  a 
in  each  generation.  Therefore, 


6.20 


iv 

To  proceed  further,  it  is  convenient  to  write  the  relative  fitnesses  as 
= 1 Wi2  = 1 -hs  U>22  = 1 - s 


The  value  of  s is  the  selection  coefficient  against  the  homozygous  aa  geno- 
types and  h is  the  degree  of  dominance  of  the  a allele.  If  h = 0,  then  a is  a com- 
plete recessive  because  AA  and  An  have  an  identical  fitness.  If  h = 1 , then  a is 
dominant  because  Aa  and  aa  have  an  identical  fitness.  Semidominance  means 
that  h = V2.  In  mutation-selection  balance,  we  are  concerned  with  harmful 
alleles  that  are  near  the  recessive  end  of  the  spectrum,  and  so  h will  usually 
be  substantially  smaller  than  0.5. 

Equilibrium  Allele  Frequencies 

When  selection  is  balanced  by  recurrent  mutation,  there  is  a globally  stable 
equilibrium  at  an  allele  frequency  of  p,  which  is  the  value  of  p in  Equation 
6.20  for  which  p'  = p.  The  equilibrium  frequency  of  the  harmful  a allele  is 
therefore  q = 1 - p.  There  are  two  important  cases: 

• When  the  harmful  allele  is  a complete  recessive  ( h = 0),  then 
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• When  the  harmful  allele  shows  partial  dominance  (h  > 0),  then,  to  an 
excellent  approximation  for  realistic  values  of  p,  h,  and  s. 


Use  of  these  equations  is  exemplified  by  Huntington  disease  in  human 
beings.  This  severe  inherited  disorder  is  characterized  by  a degeneration  of 
the  neuromuscular  system  that  typically  appears  after  age  35.  Although  the 
disease  itself  results  from  a dominant  mutation,  the  effects  on  fitness  show 
only  partial  dominance  owing  to  the  late  age  of  onset  of  the  disease.  Relative 
to  a value  of  iVu  = 1 for  the  homozygous  nonmutant  genotype,  the  fitness  of 
the  heterozygous  genotype  has  been  estimated  as  ivu  = 0.81  (Reed  and  Neel 
1959).  Homozygous  mutant  genotypes  also  have  the  disease,  but  they  are  so 
rare  that  the  equilibrium  frequency  of  the  mutant  allele  is  determined  by  the 
fitness  of  the  heterozygote.  Equation  6.22  with  hs  = 0.19  is  appropriate  in  this 
example.  If  we  knew  either  p or  q,  we  could  estimate  the  other.  In  a Michigan 
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population,  q = 5 x 1(T5  for  the  Huntington  allele  (Reed  and  Neel  1959). 
Assuming  that  the  population  is  in  equilibrium,  we  can  estimate  p from 
Equation  6.22  as  p = 5 x 1(T5  x 0.19  = 9.5  x 10"6.  This  use  of  Equation  6.22  illus- 
trates one  of  the  common  indirect  methods  for  the  estimation  of  mutation 
rates  in  human  beings. 

The  degree  of  dominance  of  a harmful  allele  is  a primary  factor  in  deter- 
mining its  equilibrium  frequency.  Harmful  alleles  held  in  mutation-selection 
balance  are  rare.  Thus  the  great  majority  of  harmful  alleles  are  present  in 
heterozygous  genotypes.  Because  there  are  so  many  heterozygous  geno- 
types, relative  to  homozygous  mutant  genotypes,  even  a small  reduction  in 
fitness  in  the  heterozygote  has  a large  effect  in  decreasing  the  equilibrium 
allele  frequency.  This  effect  is  shown  quantitatively  in  Figure  6.8,  which 
depicts  q as  a function  of  p/s  and  h.  Note  how  the  surface  bends  sharply 
upward  at  the  far-right  corner  where  h = 0.  The  increase  indicates  that,  for  a 
given  value  of  p/s,  a completely  recessive  allele  is  maintained  at  a higher 
equilibrium  frequency  than  a partially  dominant  allele.  Furthermore,  the 
surface  drops  sharply  as  h increases  from  0,  which  means  that  even  a small 
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Figure  6.8  Allele  frequencies  maintained  at  equilibrium  by  mutation-selec- 
tion balance.  At  each  point  on  the  surface,  the  height  is  the  equilibrium  frequen- 
cy  q of  a harmful  allele,  given  as  a function  of  the  mutation  rate  p (expressed  in 
multiples  of  the  selection  coefficient  s)  and  the  degree  of  dominance  h.  Note  that 
the  surface  bends  sharply  upward  toward  h = 0,  a characteristic  that  means  that 
even  a small  degree  of  dominance  results  in  a substantial  decrease  in  the  equi- 
librium frequency  of  the  harmful  allele.  The  p/s  axis  is  easiest  to  interpret  when 
the  harmful  allele  is  a lethal  (s  = 1). 


Darwinian  Selection 


239 


degree  of  dominance  can  cause  a large  reduction  in  equilibrium  frequency. 
In  general,  for  realistic  values  of  p,  s,  and  h,  the  value  of  cj  is  typically  less 
than  0.01.  Therefore,  although  mutation-selection  balance  can  account  for 
low-frequency  deleterious  alleles,  it  cannot  readily  account  for  a harmful 
allele  with  a frequency  greater  than  0.01. 


PROBLEM  6.8  To  confirm  for  yourself  that  a small  amount  of  dom- 
inance can  have  a major  effect  in  reducing  the  equilibrium  frequency 
of  a harmful  allele,  imagine  an  allele  that  is  lethal  when  homozygous 
(s  = 1)  in  a population  of  Drosophila.  Suppose  that  the  allele  is  main- 
tained by  mutation-selection  balance  with  p = 5 x 10-6.  Calculate  the 
equilibrium  frequency  of  the  allele  for  a complete  recessive  and  for 
partial  dominant  when  h = 0.025. 


ANSWER  For  a complete  recessive,  q = Vp/s  =v/(5  x 1CT6)  = 2.24  x 10~3. 
For  partial  dominance,  q = p /hs  = (5  x l(T6)/0.025  = 2.00  x 1CT4.  With 
partial  dominance,  the  equilibrium  allele  frequency  is  reduced  more 
than  tenfold,  and  the  frequency  of  homozygous  recessive  genotypes 
at  equilibrium  is  reduced  more  than  a hundredfold.  It  is  of  interest 
that  h = 0.025  is  near  the  average  degree  of  dominance  estimated  for 
"recessive"  lethals  in  Drosophila  (Simmons  and  Crow  1977). 


The  Haldane-Muller  Principle 

The  Haldane-Muller  principle,  named  after  the  geneticists  J.  B.  S.  Haldane 
(1892-1964)  and  H.  J.  Muller  (1890-1967),  deals  with  the  effect  of  mutation- 
selection  balance  on  the  average  fitness  of  a population.  Ignoring  recurrent 
mutation,  selection  would  be  able  to  rid  a population  completely  of  a harm- 
ful allele.  Then,  q = 0,  and  w = 1.  Because  of  recurrent  mutation,  the  equilib- 
rium frequency  is  greater  than  0.  When  h = 0,  the  average  fitness  in  the 
population  at  equilibrium  equals  1 - q2s  = 1 - (p/s)s  = 1 - p.  The  reduction  in 
average  fitness  due  to  mutation  therefore  equals  1 - (1  - p)  = p,  which  is 
called  the  mutation  load.  When  a is  partially  dominant,  the  mutation  load  is 
approximately  2p  because  the  average  fitness  at  equilibrium  is  1 - 2 pqhs  - q2 s 
= 1 - 2p.  This  result  is  obtained  by  ignoring  terms  in  q2  because  they  are  so 
small.  With  or  without  partial  dominance,  therefore,  the  effect  of  recurrent 
mutation  in  reducing  the  average  fitness  in  the  population  is  independent 
of  how  harmful  the  mutation  is.  That  the  effect  of  recurrent  mutation  on 
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average  population  fitness  depends  only  on  the  mutation  rate  is  the  Haldane- 
Muller  principle.  The  implication  is  that  the  harmful  effect  of  an  increase  in 
the  mutation  rate  is  the  same  irrespective  of  whether  the  mutations  produced 
are  mildly  detrimental  or  severely  harmful.  The  effects  of  severe  and  mild 
mutations  balance  out  because  a more  harmful  mutation  comes  to  a lower 
equilibrium  frequency. 


MORE  COMPLEX  TYPES  OF  SELECTION 

Although  the  two-allele  model  of  viability  selection  illustrates  the  possible 
outcomes  of  selection,  it  ignores  many  potential  complications.  For  example, 
when  the  genotypes  differ  in  fertility  rather  than  survivorship,  then  the 
model  of  viability  selection  is  inadequate  except  in  special  cases.  Most  muta- 
tions have  pleiotropic  effects;  that  is,  they  affect  more  than  one  phenotypic 
attribute  of  the  organism.  For  example,  a gene  affecting  embryonic  growth 
rate  may  also  affect  age  at  first  reproduction.  When  the  pleiotropic  effects  act 
in  opposing  directions  (for  example,  increasing  viability  but  reducing  fertili- 
ty)/ the  net  effect  on  fitness  may  be  quite  small.  As  a result,  mutations  with 
offsetting  effects  on  different  components  of  fitness  may  remain  segregating 
in  a population  for  many  generations. 

Additional  complications  arise  because  fitness  is  determined  by  many 
genes  that  interact  with  each  other.  Simple  models  of  selection  are  valid  onlv 
when  the  alleles  interact  in  such  a way  that  their  effects  on  fitness  are  addi- 
tive or  multiplicative  across  genes.  Other  complications  result  when  the  fit- 
nesses of  the  genotypes  are  not  constant  but  variable  in  time  or  space.  In  this 
section  we  briefly  examine  a sample  of  more  complex  models.  Many  of  the 
models  are  of  interest  because  they  can  maintain  genetic  polymorphisms. 
Although  the  list  is  extensive,  it  is  by  no  means  complete.  You  should  not  try 
to  memorize  all  the  different  types  of  selection.  They  are  collected  here  only 
for  ease  of  reference. 

Frequency-Dependent  Selection 

Frequency-dependent  selection  takes  place  when  fitness  is  a function  of 
either  allele  frequencies  or  genotype  frequencies.  There  is  no  restriction  on 
the  type  of  frequency  dependence  except  that  each  Darwinian  fitness  must 
be  nonnegative.  A simple  example  that  illustrates  frequency  dependence  is 
one  in  which  the  fitness  of  each  genotype  decreases  in  proportion  to  its  fre- 
quency with  a constant  of  proportionality  equal  to  c : 

AA  : wn  =l-cp2  Aa:  wu=l-2cpq  an:  w22=l-cq2 

In  this  example,  A p = cpq(q  - p)(p2  - pq  + q2)/  w,  and  so  there  are  equilibria 
at  V = 0,  y2,  and  1 . (The  factor  p2  - pq  + q2  does  not  have  a root  for  p in  the 
range  (0, 1].)  A curious  feature  of  this  type  of  frequency-dependent  selection 
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is  that,  at  equilibrium,  zw12  is  smaller  than  either  wn  or  ie22,  so  there  is  het- 
erozygote inferiority;  yet  p = V2  is  a globally  stable  equilibrium  and  w is  a 
maximum  at  this  equilibrium.  The  peculiarities  of  this  example  are  illustra- 
tive of  frequency-dependent  selection  in  general.  Because  the  fitnesses  can  be 
any  functions  of  allele  or  genotype  frequency,  nearly  anything  can  happen. 

Density-Dependent  Selection 

Density-dependent  selection  means  that  the  fitnesses  are  functions  of  the 
population  size.  Models  of  density-dependent  selection  must  explicitly 
include  population  size  and  population  growth.  With  logistic  growth  of  two 
haploid  genotypes  whose  numbers  at  time  t are  A(t)  and  B{t),  Equation  1.11 
in  Chapter  1 becomes 


Each  genotype  has  its  own  intrinsic  rate  of  increase  (r;  or  r2)  and  its  own 
carrying  capacity  (K2  or  K2),  but  they  affect  each  other's  growth  through  the 
total  population  size  A(t)  + B(f).  At  any  time,  the  outcome  of  selection 
depends  on  the  total  population  size.  When  the  population  size  is  much 
smaller  then  either  or  K2,  then  the  right-hand  factor  in  each  growth  equa- 
tion equals  approximately  1,  and  so  the  selection  is  determined  by  the  rela- 
tive values  of  r2  and  r2.  When  the  population  size  becomes  approximately 
equal  to  the  smaller  of  Kr  or  K2,  then  the  genotype  with  the  smaller  carrying 
capacity  stops  growing  while  the  other  continues,  and  so  the  selection  is 
determined  by  the  relative  values  of  Ki  and  K2.  Interesting  events  happen 
when  the  selection  for  r favors  one  genotype  and  the  selection  for  K favors 
the  other,  especially  in  situations  in  which  stochastic  factors  also  affect  pop- 
ulation size  or  there  is  a time  lag  between  population  size  and  its  affect  on 
growth  rate.  For  further  information  on  these  types  of  models,  see  Rough- 
garden  (1979),  May  (1981),  Bulmer  (1994),  and  Cohen  (1995). 

Fecundity  Selection 

In  fecundity  selection,  differences  in  fitness  between  the  genotypes  result 
from  the  differing  abilities  of  mating  pairs  to  produce  offspring.  Because  both 
genotypes  in  a mating  pair  contribute  to  the  total  number  of  offspring,  the 
number  of  fitness  parameters  potentially  equals  the  number  of  distinct  kinds 
of  mating  pairs.  For  two  alleles  of  one  gene,  there  are  nine  possible  types  of 
mating  because  reciprocal  matings  may  differ  in  the  expected  number  of  off- 
spring; for  example,  the  expected  number  of  offspring  from  the  mating 
Aa  $ xaa  6 may  differ  from  that  from  the  mating  Aa  6 x aa  9 . The  presence 
of  so  many  fitness  parameters  complicates  the  mathematical  analysis.  An 
analysis  of  selection  based  on  individual  genotypes,  analogous  to  viability 
differences,  is  not  possible  unless  the  overall  fecundity  of  any  mating  pair  can 
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be  written  as  either  the  product  or  the  sum  of  two  parameters,  one  for  each 
genotype  in  the  mating  pair.  When  this  strong  simplification  does  not  hold, 
models  of  selection  with  fertility  differences  become  rather  complex  (Ewens, 
1979;  Clark  and  Feldman  1986).  Models  in  which  differences  in  fecundity  are 
combined  with  differences  in  survivorship  can  retain  genetic  polymorphisms 
even  if  there  is  directional  selection  in  one  or  the  other  component  of  fitness. 

Age-Structured  Populations 

Age-structured  populations  with  overlapping  generations  present  problems 
even  more  formidable  than  those  caused  by  fecundity  and  survivorship  dif- 
ferences in  populations  with  discrete,  nonoverlapping  generations.  In  each 
short  interval  of  time,  a new  cohort  of  newborns  comes  into  existence  and, 
as  it  ages,  the  fate  of  each  organism  in  the  cohort  is  governed  by  the  functions 
l{x),  which  is  the  probability  of  survival  from  birth  to  age  x,  and  b(x),  which  is 
the  probability  that  an  organism  of  age  x (actually  in  the  infinitesimal  age 
interval  x to  x + dx)  reproduces.  If  the  functions  l(x)  and  b(x)  maintain  the 
same  form  over  time,  then  it  can  be  shown  that  the  population  eventually 
reaches  a stable  age  distribution  in  which  the  number  of  organisms  in  each 
age  group  increases  or  decreases  at  a constant  rate.  At  the  stable  age  distribu- 
tion, the  overall  growth  rate  of  the  population  is  the  value  of  m that  satisfies 
the  equation: 


(See  Crow  and  Kimura,  1970,  for  a derivation.)  For  this  value  of  m, 
dN/dt  = mN,  where  N is  the  total  population  size.  In  an  age-structured  popu- 
lation, m corresponds  to  the  intrinsic  rate  of  increase  denoted  r0  in  Equation 
1.7  in  Chapter  1. 

So  far  so  good,  but  genetics  complicates  this  situation  enormously.  If  the 
l(x)  and  b(x)  functions  differ  for  different  genotypes,  then  the  allele  frequen- 
cies change  through  time.  As  the  allele  frequencies  change,  so  does  the  age 
structure,  and  the  genotype  frequencies  in  each  age  class  may  be  different. 
The  result  is  that  the  age  structure  may  not  become  stable  until  selection 
reaches  some  equilibrium  (possibly  fixation).  The  sorts  of  complexities  that 
can  arise  have  been  examined  by  Charlesworth  (1980). 

Heterogeneous  Environments  and  Clines 

Heterogeneous  environments  refer  to  models  in  which  the  relative  fitnesses 
change  according  to  the  environment.  The  environmental  heterogeneity  may 
be  spatial  or  temporal  or  both.  Selection  of  this  type  can  maintain  polymor- 
phisms in  the  absence  of  overdominance.  If  each  homozygous  genotype  is 
favored  in  a different  subset  of  environments,  then  there  can  be  marginal 
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overdominance,  in  which  the  heterozygous  genotype  has  the  highest  fitness 
when  averaged  across  all  the  environments,  even  though  it  is  not  the  most  fit 
genotype  in  any  particular  environment. 

In  some  cases,  the  relative  fitnesses  of  the  genotypes  vary  geographically 
across  a more  or  less  smooth  environmental  gradient,  for  example,  according 
to  latitude,  altitude,  aridity,  or  salinity.  If  sufficiently  stable  in  time,  a gradient 
of  selection  across  a region  can  result  in  a gradient  of  allele  frequency  across 
the  region.  A geographical  trend  in  an  allele  frequency  is  called  a cline.  An 
unusually  extreme  example  of  a cline  is  found  in  the  hemoglobin-11  allele  in 
the  eelpout  fish  Zoarces  viviparus,  the  allele  frequency  of  which  drops  from  a 
value  of  nearly  1 in  the  North  Sea  to  a value  of  nearly  0 in  the  Baltic  Sea 
(Christiansen  and  Frydenberg  1974).  In  human  aboriginal  populations,  there 
is  a cline  of  increasing  frequency  of  the  allele  IB  in  the  ABO  blood  groups 
from  Southwest  to  Northeast  Europe. 

Although  dines  can  result  from  selection — for  example,  when  one  geno- 
type is  favored  at  one  extreme  of  the  environmental  gradient  but  disfavored 
at  the  other  extreme — dines  can  also  result  from  other  processes.  Migration  is 
one  possibility:  differences  in  allele  frequency  in  local  populations  at  the 
extremes  of  the  range  may  result  from  chance  processes  (for  example,  differ- 
ent founding  populations),  and  migration  of  organisms  from  the  extremes 
into  the  intermediate  zone  produces  the  cline. 

The  strongest  evidence  that  a cline  results  from  selection  is  when  a cline  is 
reproduced  in  different  locations  along  a similar  environmental  gradient.  A 
example  of  parallel  dines  played  out  on  a grand  scale  is  found  in  the  elec- 
trophoretic polymorphism  of  alcohol  dehydrogenase  (the  Adh  gene)  in  D. 
melanogaster.  In  Eastern  North  America,  the  frequency  of  the  AdhF  allele 
increases  as  one  goes  north,  whereas  DNA  polymorphisms  flanking  Adh 
show  no  such  geographic  trend  (Berry  and  Kreitman  1993).  The  cline  is 
shown  in  the  upper  part  of  Figure  6.9.  The  frequency  of  Adhf  is  correlated 
with  cooler  temperatures  and  less  rainfall  in  the  more  northern  latitudes.  In 
Australia,  as  shown  in  the  lower  part  of  Figure  6.9,  the  frequency  of  the  AdhF 
allele  increases  as  one  goes  south  (Oakeshott  et  al.  1982).  This  pattern  is  in 
apparent  contradiction  to  that  in  Eastern  North  America  but,  because  Aus- 
tralia is  in  the  Southern  Hemisphere,  the  dines  are  actually  parallel.  Both 
show  an  increase  in  the  frequency  of  AdhF  as  one  proceeds  from  the  equator 
toward  the  polar  cap — the  North  Pole  in  the  Northern  Hemisphere  and  the 
South  Pole  in  the  Southern  Hemisphere.  On  a much  smaller  geographical 
scale,  in  mountainous  regions,  the  frequency  of  the  AdhF  allele  shows  a clinal 
increase  with  altitude,  which  is  again  correlated  with  cooler  temperature  and 
less  rainfall.  Data  from  the  Caucasus  Mountains  (Grossman  et  al.  1970)  have 
been  discussed  in  Problem  4.2;  parallel  dines  have  also  been  studied  in  the 
mountains  of  Mexico  (Pipkin  et  al.  1976). 
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scale  near  the  extreme  values  of  p:  for  values  of  p = 0.1,  0.5,  and  0.9,  the  values  of 
arcsin(Vp)  are  0.322,  0.785,  and  1.249,  respectively,  where  the  angles  are  mea- 
sured in  radians.  The  angular  transformation  is  often  used  for  proportions 
because  it  separates  the  variance  of  an  estimate  from  the  estimate  itself:  for  a 
binomial  proportion  p based  on  n observations,  the  variance  of  p is  p(l  - p) /n, 
whereas  the  variance  of  arcsin(Vp),  with  the  angle  expressed  in  radians,  is 
approximately  l/4n.  (North  American  data  from  Berry  and  Kreitman  1993;  Aus- 
tralian data  from  Oakeshott  et  al.  1982.) 


Diversifying  Selection 

The  term  diversifying  selection  refers  narrowly  to  selection  that  favors 
extreme  phenotypes.  In  a normal  distribution  of  phenotypes,  for  example, 
diversifying  selection  means  that  organisms  in  the  tails  of  the  distribution  are 
favored  relative  to  those  in  the  middle.  More  generally,  diversifying  selection 
refers  to  any  type  of  selection  in  which  genotypes  are  favored  merely  because 
they  are  different.  Genes  under  diversifying  selection  tend  to  maintain  a 
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relatively  large  number  of  alleles.  Examples  include  genes  of  the  major 
histocompatibility  complex  in  mammals,  in  which  the  selective  agent  is 
thought  to  be  through  resistance  to  parasitic  microorganisms  (Satta  et  al. 
1993)  and  bacterial  genes  that  produce  toxins  (colicins)  that  kill  other  bacte- 
ria, in  which  the  selective  agent  is  the  destruction  of  competitors  (Riley  1993; 
Ayala  et  al.,  1994). 

Some  plants  have  genes  for  gametophytic  self-incompatibility,  in  which 
a pollen  grain  that  carries  any  self-incompatibility  allele  is  unable  to  pollinate 
a plant  that  carries  the  same  allele.  Self-incompatibility  of  this  type  implies 
that  no  plant  can  fertilize  itself.  Because  a plant  of  genotype  S,S,  can  produce 
only  S,  and  S(  pollen,  the  pollen  cannot  fertilize  S,S;  plants.  Furthermore, 
homozygous  genotypes  are  not  normally  found  because  their  formation 
would  require  that  S,  pollen  fertilize  an  S,S,  plant.  It  is  easy  to  show  that  there 
is  positive  selection  for  new  self-sterility  alleles  and  that,  at  equilibrium, 
every  allele  has  the  same  frequency.  For  n alleles,  if  S,  has  frequency  p„  then 
the  frequency  of  S,Sy  genotypes  with  random  mating  is  2p,(l  - p,)/(l  - Ip,2). 
The  denominator  is  necessary  because  of  the  absence  of  homozygous 
genotypes.  The  probability  that  an  S,  pollen  can  be  successful  in  fertilization 
is  therefore  the  probability  of  genotypes  other  than  S,Sy,  which  equals 
1 - 2p,(l  - p^/ (1  - Ip,2).  At  equilibrium,  we  must  have  p,(l  - p,)  = p;(  1 - pj). 
From  these  expressions  follow  some  important  conclusions  summarized  in 
Problem  6.10.  For  more  information  on  gametophytic  self-incompatibility 
systems,  see  Ioerger  et  al.  (1991)  and  Uyenoyama  (1995). 


PROBLEM  6.9  Show  that  p,(l  - pj)  = p;(l  - pj)  for  all  i and  j implies  that 
Pi  = pj  = l/n,  where  n is  the  number  of  self-incompatible  alleles  and  n > 
3.  Use  these  equilibrium  allele  frequencies  to  show  that  the  probability 
that  a pollen  grain  lands  on  a compatible  style  equals  (n  - 2 )/n.  Finally, 
show’  that  the  probability  of  successful  fertilization  by  a new  mutant  S 
allele,  relative  to  that  of  any  preexisting  allele,  equals  n/(n  - 2). 


ANSWER  p,(l  - pi)  = pj(  1 - pj)  implies  that  p,  ~pj  = p 2-  pf=  (p,  - pj) 
(pi  + pj)  so  that  either  p,  = p;  for  all  i and  j or  p,  + p;  = 0.  Because  n>  3, 
(p,  + Pj)  * 1.  Because  there  are  n alleles,  we  must  have  Ip,  = 1,  and  so  p, 
= 1 In.  The  probability  of  a pollen  grain  landing  on  a compatible  style 
is  1 - 2p,(l  - p,)/(l  - Ip,-2)  = 1 - 2/n  = («  - 2 )/n.  A pollen  grain  contain- 
ing a newly  arising  S allele  will  always  land  on  a compatible  style. 
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and  so  its  probability  of  fertilization,  relative  to  that  of  a preexisting 
allele,  equals  1 /[(n  - 2 )/n]  = n/(n-  2).  If  effect,  this  is  the  relative 
fitness  of  a new  mutation.  For  n = 3, 4, 5, 10, 50,  and  100,  it  equals  3, 2, 
1.67, 1.25, 1.04,  and  1.02,  respectively. 


Differential  Selection  in  the  Sexes 

Some  genes  may  have  different  effects  in  the  two  sexes.  If  the  fitnesses  of 
genotypes  differ  between  the  sexes,  then  genotypes  that  are  disfavored  in  one 
sex  may  be  favored  in  the  other.  The  offsetting  effects  increase  the  opportuni- 
ty for  a balanced  polymorphism.  The  survivorship  model  of  selection  can  be 
extended  to  include  this  case  by  supposing  that  the  relative  viabilities  of  the 
genotypes  AA,  Aa,  and  aa  are  given  by  zvu,  zvu,  and  iv22  in  females  and  by  vu, 
v12,  and  v22  in  males.  One  of  the  iv's  and  one  of  the  v's  can  be  set  arbitrarily  to 
1,  which  leaves  four  fitness  parameters  rather  than  two.  A more  serious  com- 
plication is  that  the  allele  frequencies  in  gametes  are  no  longer  the  same  in 
males  and  females.  Letting  pf  and  pm  be  the  allele  frequency  of  A in  female 
and  male  gametes,  respectively,  then  the  genotype  frequencies  of  AA,  An,  and 
aa  in  the  zygotes  are  pfpm,  pfqm  + cfrpm,  and  c]fqm/  respectively,  where  qf  = l-pf 
and  qm  = 1 - pm.  One  of  the  consequences  of  differential  selection  in  the  sexes 
is  that,  with  an  appropriate  choice  of  fitnesses,  it  is  possible  to  have  more  than 
one  stable  polymorphic  equilibrium.  A stable  equilibrium  is  also  possible 
with  heterozygote  inferiority  in  one  sex  or  with  incomplete  dominance  when 
selection  works  in  opposite  directions  in  the  two  sexes. 

X-linked  Cenes 

Genes  located  in  the  X chromosome  can  have  the  same  sort  of  complications 
as  differential  selection  in  the  sexes,  but  the  possibilities  for  polymorphism 
are  not  quite  so  numerous  because  there  are  only  three  fitness  parameters 
instead  of  four.  If  A and  a are  alleles  of  an  X-linked  gene,  then  there  are  three 
genotypes  in  females  (AA,  Aa,  and  aa)  and  two  genotypes  in  males  (either  A 
or  a along  with  the  Y chromosome).  One  fitness  parameter  in  each  sex  can  be 
set  arbitrarily  to  1.  As  with  differential  selection  in  the  sexes,  the  allele  fre- 
quencies differ  in  eggs  and  sperm.  However,  in  any  generation,  the  frequen- 
cy of  A in  male  zygotes  equals  the  frequency  of  A in  female  gametes  of  the 
preceding  generation.  If  you  do  not  understand  why,  think  about  the  parental 
origin  of  the  X chromosome  in  a male. 

Gametic  Selection 

Many  plants  go  through  a life  cycle  in  which  both  haploid  products  of  meio- 
sis  and  the  diploid  products  of  fertilization  are  exposed  to  selection.  In 
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mosses  and  vascular  plants,  for  example,  a diploid  organism  (the  sporophyte) 
produces  spores  each  of  which  germinates  to  form  a haploid  organism  (the 
gametophyte ) that  reproduces  asexually  by  mitosis.  The  gametophytes  give 
rise  to  haploid  male  and  female  gametes,  which  undergo  fertilization  creat- 
ing a new  diploid  generation.  In  mosses,  the  prominent  stage  of  the  life  cycle 
is  the  gametophyte  whereas,  in  higher  plants,  the  prominent  stage  is  the 
sporophyte. 

When  the  haploid  phase  of  the  life  cycle  is  exposed  to  selection,  the  selec- 
tion is  called  gametic  selection.  As  a concrete  model,  suppose  that  the  rela- 
tive survivorships  of  A and  a gametophytes  (the  haploid  phase)  are  given  by 
Vi  and  v2,  respectively.  In  the  sporophytes  (the  diploid  phase),  the  survivor- 
ships can  be  written  as  before  as  ®u,  w \2,  and  iv12.  If  p and  q are  the  allele  fre- 
quencies of  A and  a at  the  beginning  of  the  haploid  phase,  then  after  the 
differential  haploid  mortality  has  taken  place,  the  frequencies  will  be  p*  = 
pvi/v  and  q*  = qv2/v,  where  v = pvx  + qv2.  With  random  fertilization  among 
the  gametes,  the  diploid  genotypes  AA,  Aa,  and  aa  are  formed  in  the  propor- 
tions p*2,  2 p*q*,  and  q*1,  and  these  survive  in  the  relative  proportions  W\\,  wn, 
and  w22.  You  may  verify  for  yourself  that,  at  the  beginning  of  the  haploid 
phase  of  the  next  generation,  the  allele  frequency  of  A is 

P2W\\V\  +pqiVuV\V2 
^ p2wX\v\  + 7.pqw\2vpv2  + q2w22vl 

This  equation  has  the  same  form  as  the  equation  for  p'  in  Equation  6.10 
except  that  wn  is  replaced  with  wuv2,  wn  with  w12v po2,  and  w22  with  iv22v2. 
The  conditions  for  fixation  or  for  a stable  or  unstable  equilibrium  are  there- 
fore determined  by  the  relative  magnitude  of  the  composite  "fitness"  of  the 
heterozygous  genotype  relative  to  those  of  the  homozygous  genotypes. 

Meiotic  Drive 

A situation  analogous  to,  but  distinct  from,  gametic  selection  takes  place 
when  there  is  non-Mendelian  segregation  in  the  heterozygous  genotype.  In 
females,  unequal  recovery  of  reciprocal  products  of  meiosis  can  be  caused  by 
nonrandom  segregation  of  homologous  chromosomes  to  the  functional  egg 
nucleus,  which  is  why  non-Mendelian  segregation  is  known  generically  as 
meiotic  drive.  In  other  cases,  the  unequal  recovery  is  caused  by  a gene  or 
genes  that  act  to  render  gametes  carrying  the  homologous  chromosome  non- 
functional. Examples  include  "sperm  killers"  such  as  segregation  distortion  in 
Drosophila  melanogaster  (Charlesworth  and  Hartl  1978)  and  the  t alleles  in  the 
house  mouse  (Hammer  and  Silver  1993)  as  well  as  "spore  killers"  described 
in  filamentous  fungi  (Raju  1994). 

Because  meiotic  drive  acts  only  in  the  heterozygous  genotype,  its  effect  is 
to  alter  the  term  pqwn  in  Equation  6.10  for  p'.  This  term  comes  from  the 
expression  V2  x 2pqwn  for  the  proportion  of  A-bearing  gametes  from  surviv- 
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ing  Aa  genotypes,  and  the  V2  is  the  Mendelian  segregation  ratio.  If  the  ratio  of 
A : a gametes  from  Aa  heterozygotes  is  k : 1 - k instead  of  V2  : V2,  then  the 
expression  for  p'  becomes 


/ = p2u’n+2kpcjiou  6 23 

w 

where  w is  the  average  survivorship  in  the  population  defined  in  Equation 
6.8.  Since  A is  the  driven  allele,  k > l/2-  Equation  6.23  is  illustrative  of  meiotic 
drive  even  though  it  requires  that  the  non-Mendelian  segregation  affect  both 
sexes  equally,  a case  that  is  not  generally  found  in  practice.  One  implication 
of  the  equation  is  that,  unless  selection  counteracts  the  meiotic  drive,  the  dri- 
ven allele  goes  to  fixation.  In  particular,  if  the  relative  viabilities  are  equal, 
then  p'  = p‘  + 2kpq  and  A p = pq(2k  - 1),  so  that  p -» 1 because  k > V2. 

In  some  examples  of  meiotic  drive,  including  segregation  distortion  and  the 
t alleles,  the  driven  allele  is  lethal  when  homozygous  (Hartl  1970).  Assum- 
ing that  the  lethality  is  completely  recessive,  the  survivorships  are  wn  = 0, 
W\2  = 1,  and  cu22  = 1.  Equation  6.23  implies  that  p'  = 2kp/ (1  + p)  and  so  A p = 
p[(2k  - 1)  - p]/(l  + p).  There  is  an  interior  equilibrium  at  p = 2k  - 1,  which 
intuition  suggests  (correctly)  is  locally  stable.  It  is  also  globally  stable  (Figure 
6.10).  Note  that  p is  between  0 and  1 for  any  value  of  k between  V2  and  1.  The 
calculations  for  a recessive-lethal  driven  allele  are  a special  case  of  the  slight- 
ly more  general  model  discussed  in  Problem  6.10. 


PROBLEM  6. 1 0 Suppose  that  the  AA  genotype  is  not  completely 
lethal  but  that  its  survivorship  is  given  by  1 - s relative  to  a value  of  1 
for  both  Aa  and  aa  genotypes.  Show  that  A p = pq[(2k  - 1)  - ps]/ 
(1  - p2s).  Find  p and  define  the  conditions,  in  terms  of  k and  s,  for 
which  p is  between  0 and  1.  Show  also  that  the  equilibrium  is  locally 
stable. 


ANSWER  Equation  6.23  implies  that  p'  = [p2(l  - s)  + 2kpq]/(l  - p2s). 
Ap  = p'  -p  simplifies  to  the  formula  given.  Setting  Ap  = 0 yields  equi- 
libria at  0, 1,  and  p = (2k-  1)/ s.  For  p > 0,  we  need  (2k  - l)/s  > 0,  or  k 
> V2.  For  p < 1,  we  need  (2k  - l)/s  < 1,  or  k < (s  + l)/2.  Note  that,  as 
the  selection  against  the  A allele  becomes  smaller  (s  closer  to  0),  more 
values  of  k result  in  fixation  of  the  unfavorable  A allele  and  fewer 
result  in  an  interior  equilibrium.  The  stability  of  p can  be  deduced  by 
evaluating  the  derivative  in  Equation  6.19.  For  this  purpose,  it  is 
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(A)  Viability  only 


(B)  Meiotic  drive  only 


(C)  Viability  and  meiotic  drive 


Figure  6. 1 0 The  balance  between  meiotic  drive  and  viability  selection.  (A)  A p 
versus  p for  viability  alone,  when  the  fitnesses  are  wu  = w12  = 1 and  zv22  = 0.6. 
With  these  fitnesses,  viability  selection  would  eliminate  the  a allele.  (B)  Meiotic 
drive  alone,  where  the  heterozygous  genotype  An  produces  40%  A-bearing 
gametes  and  60%  ^-bearing  gametes.  With  meiotic  drive  alone,  the  A allele 
would  be  lost.  (C)  A p versus  p when  both  viability  selection  and  meiotic  drive 
are  operating  at  the  same  time,  using  the  same  fitness  and  meiotic  drive  para- 
meters as  above.  In  this  example,  when  both  processes  operate  simultaneously, 
their  offsetting  effects  create  a stable  polymorphism. 
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convenient  to  write  A p as  pqs(p  - p)/(  1 - p2s).  In  taking  the  deriva- 
tive, remember  that  any  term  containing  p-p becomes  0 when  p = p, 
so  these  terms  can  be  neglected.  The  derivative,  evaluated  at  p,  equals 
-pq  s/ (1  - p2s),  where  q = l-p.  The  sign  of  this  number  must  be  neg- 
ative, and  so  the  equilibrium  at  p,  when  it  exists,  is  locally  stable. 


Multiple  Alleles 

The  presence  of  multiple  alleles  complicates  the  analysis  of  selection  be- 
cause the  number  of  fitness  parameters  increases.  With  n alleles,  there  are 
n(n  + l)/2  possible  genotypes,  each  with  its  own  fitness.  Furthermore,  sim- 
ple generalizations  from  two-allele  theory  do  not  necessarily  carry  over  to 
multiple  alleles.  Consider  the  example  of  heterozygote  superiority.  Intu- 
itively, one  might  expect  that  fitnesses  yielding  stable,  multiple-allele  poly- 
morphisms would  be  easy  to  generate  by  requiring  that  each  heterozygous 
genotype  have  a greater  fitness  than  the  homozygous  genotypes  formed 
from  the  constituent  alleles.  This  is  not  the  case,  however.  If,  for  n alleles, 
the  fitnesses  of  the  genotypes  are  assigned  at  random  between  0 and  1,  sub- 
ject to  the  condition  that,  for  each  i and  j,  w,j  > max(w,„  Wjj),  then  only  a rel- 
atively small  proportion  of  systems  with  four  or  more  alleles  yields  a stable 
polymorphism  with  all  alleles  present.  For  four,  five,  and  six  alleles,  the 
percentage  of  fitness  sets  yielding  a stable  equilibrium  is  12.6, 1.2,  and  0.03, 
respectively  (Lewontin  et  al.  1978).  The  reason  for  the  low  percentages  is 
that,  even  if  a heterozygote  is  more  fit  than  its  constituent  homozygotes, 
there  might  be  a different  homozygote  more  fit  than  all  three.  All  right,  how 
about  requiring  that  each  heterozygote  be  better  than  every  homozygote? 
Surprisingly,  this  requirement  does  not  help  matters  much.  In  this  case,  for 
four,  five,  and  six  alleles,  the  percentage  of  fitness  sets  yielding  a stable 
equilibrium  is  34.3,  10.4,  and  1.3,  respectively  (Lewontin  et  al.  1978).  The 
point  is  that  polymorphisms  with  greater  than  three  or  four  alleles  are 
extremely  unlikely  to  be  maintained  by  selection  for  simple  heterozygous 
advantage  with  constant  survivorship.  If  selection  is  implicated  in  such  a 
case,  models  of  selection  such  as  diversifying  selection  or  heterogeneous 
environments  are  much  more  plausible.  On  the  other  hand,  the  fitnesses  of 
genotypes  in  nature  are  not  chosen  simultaneously  by  a random  number 
generator.  Each  new  allele  that  arises  is  tested  against  the  resident  alleles, 
and  the  new  allele  is  able  to  invade  the  population  if  its  marginal  fitness 
exceeds  the  mean  fitness  of  the  population.  By  this  process,  multiple  allele 
polymorphisms  can  be  accumulated,  and  the  order  in  which  the  mutations 
appear  makes  a difference  (Spencer  and  Marks  1988). 
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The  possibility  of  multiple  alleles  also  creates  surprising  situations  in 
which  the  outcome  of  natural  selection  depends  on  the  order  in  which  the 
alleles  are  introduced  into  the  population.  Earlier  in  this  chapter  we  men- 
tioned the  sickle-cell  hemoglobin  polymorphism  in  Africa  and  its  relation  to 
malaria  resistance.  People  who  are  homozygous  AA  for  the  normal  allele  are 
susceptible  to  falciparum  malaria,  those  who  are  heterozygous  AS  for  the 
sickle-cell  allele  are  resistant  to  malaria  and  have  a mild  anemia,  and  those 
who  are  homozygous  SS  for  the  sickle-cell  allele  have  a life-threatening  ane- 
mia. This  is  a classic  case  of  heterozygote  superiority.  There  is  another  allele, 
C,  found  at  low  frequency  in  populations  in  which  the  S allele  is  prevalent. 
The  C allele  is  also  protective  against  malaria,  but  the  allele  is  recessive,  and 
so  only  the  CC  genotypes  are  resistant.  Unlike  the  S allele,  the  C allele  does 
not  cause  anemia. 

The  relative  survivorship  of  each  of  the  various  hemoglobin  genotypes 
has  been  estimated  based  on  studies  of  more  than  32,000  people  in  72  popu- 
lations in  West  Africa  (Cavalli-Sforza  and  Bodmer  1971).  The  survivorships 
are  given  in  the  following  table,  which  indicates  the  genotypes  that  are  resis- 
tant and  those  that  have  severe  hemolytic  anemia.  The  survivorships  were 
estimated  in  a geographical  region  where  malaria  was  common.  Note  that 
the  S allele  causes  a severe  anemia  in  the  heterozygous  SC  genotype,  but  not 
so  serious  as  that  in  the  homozygous  SS  genotype. 

Genotype  AA  AS  55  AC  5C  CC 

Survivorship  0.9  1.0  0.2  0.9  0.7  1.3 

Health  status  Resistant  Anemic  Anemic  Resistant 

Inspection  of  these  survivorships  reveals  a paradox.  The  CC  genotype  has 
the  highest  fitness,  yet  the  C allele  is  not  fixed.  The  reason  is  found  in  the 
historical  order  in  which  the  S and  C mutations  took  place.  The  A allele  is 
the  ancestral  type  and  undoubtedly  predated  the  human  settlement  of 
regions  subject  to  malaria.  In  such  a region,  the  appearance  of  an  S allele  cre- 
ates a heterozygous  advantage,  and  natural  selection  quickly  attains  a stable 
equilibrium  at  which  the  ratio  of  A : S alleles  is  approximately  8 : 1.  At  this 
equilibrium,  the  average  fitness  in  the  population  is  w = 0.911.  Now  suppose 
that  mutation  or  migration  were  to  introduce  a small  number  of  C alleles. 
Because  C alleles  are  rare,  each  is  present  in  either  the  AC  genotype,  with 
probability  %,  or  in  the  SC  genotype,  with  probability  V9.  The  average  fitness 
of  genotypes  heterozygous  for  C is  therefore  0.878,  which  is  smaller  than  the 
average  fitness  in  the  population.  Hence,  the  frequency  of  C decreases,  and  C 
goes  extinct.  The  C allele  has  no  chance  of  invading  an  A/S  polymorphism 
unless  the  initial  frequency  of  C is  sufficiently  large.  Figure  6.11  illustrates 
this  phenomenon.  With  the  survivorships  given  in  this  example,  the  critical 
initial  frequency  of  C that  allows  invasion  is  0.073.  Once  C can  get  established 
in  the  population,  it  eventually  becomes  fixed. 
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Figure  6. 11  Change  in  frequency  of  the  hemoglobin  C allele  in  a population 
in  which  the  A and  S alleles  are  present  in  their  equilibrium  proportions  of  8 : 1. 
When  the  initial  frequency  of  C is  small,  the  change  in  frequency  is  negative, 
and  so  C is  eliminated  even  though  CC  genotypes  have  the  highest  fitness.  The 
C allele  is  unable  to  invade  unless  its  initial  frequency  is  greater  than  0.073,  and 
in  that  case  C goes  to  fixation.  The  plot  is  based  on  the  survivorship  values 
given  in  the  text. 

Multiple  Loci  and  Gene  Interaction:  Epistasis 

With  multiple  loci,  as  many  types  of  gametes  are  possible  as  there  are  combi- 
nations of  alleles.  The  simplest  example  is  the  two-locus,  two-allele  case,  in 
which  the  possible  gametes  are  AB,  Ab,  aB,  and  ab.  In  the  absence  of  recombi- 
nation ( r = 0),  each  type  of  gamete  can  be  regarded  as  an  "allele"  of  one  locus 
with  four  alleles.  The  principles  of  multiple-allele  selection  then  apply,  and 
some  of  the  "alleles"  may  be  eliminated  by  selection.  The  presence  of  recom- 
bination complicates  matters  because  each  gametic  type  is  continually  recre- 
ated by  recombination  even  if  it  is  disfavored  by  selection.  The  influence  of 
recombination  on  the  outcome  of  selection  is  determined  by  the  recombina- 
tion fraction  and  by  the  degree  of  interaction  between  the  loci.  When  selec- 
tion acts  on  the  phenotype  produced  by  the  joint  effects  of  multiple  loci,  there 
are  two  general  situations: 

• Changes  in  allele  frequency  are  driven  primarily  by  the  selection  coeffi- 
cients and  recombination  plays  a minor  role. 

• Selection  and  recombination  are  about  equally  important  in  determining 
the  outcome. 

The  former  is  usually  the  case  with  weak  epistasis  and  moderate  or  loose 
linkage;  the  latter  is  more  prevalent  with  strong  epistasis  and  tight  linkage. 
The  term  epistasis  is  often  used  in  population  genetics  as  a synonym  for 
gene  interaction;  it  applies  to  any  situation  in  which  the  genetic  effects  of 
different  loci  that  contribute  to  a phenotypic  trait  are  not  additive.  In  the  two- 
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TABLE  6.3  TWO-LOCUS  FITNESSES  (SURVIVORSHIPS) 
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Note:  The  table  assumes  that  the  two  types  of  double  heterozygotes,  AB/ab  and  Ab/aB,  have 
the  same  fitness,  iv2 2- 


locus,  two-allele  example,  the  fitnesses  (survivorships)  of  the  genotypes  can 
be  written  as  shown  in  Table  6.3,  where  it  is  assumed  that  the  two  types  of 
double  heterozygote  (AB/ab  and  Ab/aB)  have  the  same  fitness;  for  conve- 
nience, this  value  is  often  set  at  IV22  — 1.  For  each  single-locus  genotype,  the 
average  survivorship  is  equal  to  the  weighted  average  across  each  genotype 
at  the  other  locus.  In  Table  6.3,  these  averages  are  denoted  wAA,  wAa,  and  so 
on.  Additivity  across  loci  means  that  wu  = zvAA  + wBB,  ivn  = wAA  + wBb,  and  so 
forth  for  all  genotypes,  including  w2 2 = wAa  + wBb  = 1.  If  additivity  does  not 
apply  across  all  nine  genotypes,  then  epistasis  is  said  to  be  present.  A discus- 
sion of  epistasis  from  a statistical  point  of  view  appears  in  Chapter  9. 

When  there  is  strong  epistasis  and  tight  linkage,  complications  abound. 
With  two  loci  and  two  alleles  at  each,  there  are  as  many  as  15  equilibria.  Most 
of  them  are  unstable,  but  examples  are  known  in  which  four  interior  equilib- 
ria are  simultaneously  stable.  Figure  6.12  is  one  example  that  shows  the  aver- 
age fitness  in  the  population  Tu  as  a function  of  the  allele  frequencies  of  A and 
B.  At  any  point  in  time,  the  gametic  frequencies  in  the  population  are  deter- 
mined not  only  by  the  allele  frequencies  of  A and  B but  also  by  the  linkage 
disequilibrium  parameter,  which  was  denoted  by  the  symbol  D in  the  section 
on  linkage  disequilibrium  in  Chapter  3.  In  this  example,  all  15  equilibria  are 
realized.  There  are  four  comer  equilibria  in  which  one  gametic  type  is  fixed, 
namely,  AB,  Ab,  aB,  or  ab;  there  are  also  four  edge  equilibria  in  which  one 
allele  of  either  locus  is  fixed,  namely,  A,  a,  B,  or  b.  With  the  survivorships  as  in 
Figure  6.12,  all  of  the  corner  and  edge  equilibria  are  unstable.  There  are  also 
three  unstable  interior  equilibria,  each  of  which  has  pA  = pB  = i/2  and  so  is 
located  at  the  position  of  the  open  circle  on  the  saddle  in  Figure  6.12;  these 
equilibria  have  the  same  allele  frequencies  but  differ  in  the  degree  of  linkage 
disequilibrium.  The  positions  of  the  stable  equilibria  are  indicated  by  the 
solid  circles,  each  of  which  represents  two  equilibrium  points  with  the  same 
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Two  stable  equilibria  with 


Frequency  of  A allele  0 


Figure  6.1 2 An  example  of  two-locus,  two-allele  survivorship  selection  in  which 
there  are  four  stable  interior  equilibria,  the  positions  of  which  are  indicated  by  the 
dots  near  two  of  the  comers.  Each  dot  represents  two  stable  points  differing  in  the 
sign  of  the  linkage  disequilibrium.  This  example  also  includes  three  unstable  inte- 
rior equilibria  (represented  by  the  open  circled  point  in  the  center),  four  unstable 
edge  equilibria,  and  four  unstable  comer  equilibria.  The  survivorship  parameters, 
in  the  notation  of  Table  6.3,  are  zvu  = uq3  = zv31  = w33  = 0.9,  zv2\  = w23  = 0.8,  and  ivu  = 
zv32  = 0.6;  the  recombination  fraction  is  r = 0.09.  (Example  from  Hastings  1985.) 


allele  frequencies  but  differing  in  their  value  of  D.  In  this  case,  the  equilibria 
are  symmetrical. 

Figure  6.12  is  a plot  of  w to  emphasize  that  the  average  fitness  in  the  pop- 
ulation is  not  necessarily  a maximum  at  equilibrium.  In  this  example,  none  of 
the  four  stable  equilibria  is  a point  of  maximum  average  fitness.  The  maxi- 
mum fitness  is  found  at  either  of  the  four  corners,  and  these  equilibria  are 
unstable.  Furthermore,  in  the  vicinity  of  each  stable  equilibrium,  as  the  pop- 
ulation moves  toward  the  equilibrium  from  certain  directions,  the  average  fit- 
ness must  decrease  as  the  equilibrium  is  approached.  Hence,  not  only  is 
average  fitness  not  necessarily  a maximum  at  equilibrium,  natural  selection 
can  cause  a decrease  in  average  fitness. 
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In  models  in  which  fitness  depends  on  multiple  interacting  loci,  do  we 
really  have  to  give  up  the  attractive  generalization  from  one-locus  theory  that 
selection  acts  in  such  a way  as  to  increase  average  fitness?  Not  altogether. 
Although  even  the  two-locus,  two-allele  model  of  survivorship  selection  is 
beyond  present  techniques  of  mathematical  analysis,  an  important  general- 
ization has  come  from  approximate  solutions  as  well  as  from  computer  sim- 
ulations (Ewens  1979):  if  epistasis  is  not  too  strong,  and  linkage  is  not  too 
tight,  then  the  average  fitness  in  the  population  usually  increases. 

This  statement  is  multiply  qualified  ("not  too  strong  . . . not  too  tight . . . 
usually  increases")  because  exceptions  can  rather  easily  be  constructed. 
However,  the  generalization  is  observed  in  most  generations  in  most  numer- 
ical examples  when  the  survivorships  are  chosen  at  random  (Karlin  and 
Carmelli  1975).  To  the  extent  that  it  is  true,  the  generalization  supports  the 
powerful  metaphor  that  natural  selection  tends  to  increase  average  fitness.  If 
one  can  imagine  a complex  surface  of  hills  and  valleys  corresponding  to 
regions  of  high  and  low  average  fitness,  then  one  can  speak  metaphorically  of 
a population  as  a sort  of  "hill  climber"  moving  across  this  surface  and  scaling 
a fitness  "peak."  This  picturesque  analogy  is  a central  concept  in  Wright's 
shifting  balance  theory  of  evolution,  which  is  discussed  in  the  last  section  of 
this  chapter.  However,  there  are  enough  exceptions  to  the  hill-climbing  gen- 
eralization that  maximization  of  average  fitness  cannot  be  used  as  a guide  to 
predicting  the  outcome  of  any  particular  set  of  fitness  values.  Each  model 
must  be  considered  in  detail  on  its  own. 

Sexual  Selection 

It  seems  that,  wherever  you  look  in  nature,  animals  have  physical  adornments 
or  behavioral  displays  to  help  them  in  obtaining  mates.  In  some  cases,  there 
is  direct  competition  between  animals,  usually  males,  as  exemplified  by  the 
contests  of  antler  bashing  in  moose  or  head  butting  in  bighorn  sheep.  In  other 
cases,  there  is  indirect  competition,  as  seen  in  the  behavioral  displays  of  male 
peacocks  in  full  plumage  strutting  their  stuff.  These  are  dangerous  activities. 
A bighorn  sheep  can  get  his  skull  fractured  or  fall  off  a cliff.  The  male  peacock 
is  conspicuous,  burdened,  and  preoccupied — vulnerable  to  any  predator. 

Darwin  (1871)  was  the  first  to  draw  attention  to  competition  for  mates  as  a 
source  of  selection  not  necessarily  related  to  adaptation  of  the  organism  to  its 
environment.  This  type  of  selection  he  called  sexual  selection.  In  the  case  of 
direct  competition  for  mates,  it  is  easy  to  understand  that  a successful  male 
leaves  more  progeny  than  an  unsuccessful  male,  and  so  alleles  promoting  the 
physical  adornments,  strength,  and  aggressiveness  needed  for  successful 
competition  for  mates  are  perpetuated  even  though  they  may  occasionally  be 
detrimental.  The  example  of  indirect  competition  is  considerably  more  subtle 
because  the  male  is  merely  advertising.  The  female  does  the  choosing.  One 
theory  for  the  evolution  of  male  sexual  displays  is  that,  in  the  early  stages  of 
their  evolution,  the  displays  take  advantage  of  a female  preference.  The  origin 
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of  the  initial  preference  is  unclear.  Darwin  suggested  that  female  choosiness 
and  offspring  number  are  both  associated  with  superior  nutrition,  hence 
choosy  females  may,  at  the  beginning,  have  had  more  offspring.  Whatever  the 
cause,  given  an  initial  choosiness  among  females,  males  with  more  effective 
displays  are  chosen  preferentially  as  mates,  and  their  offspring  receive  alleles 
that  create  both  the  displays  in  the  males  and  the  preferences  in  the  females.  If 
these  traits  are  genetically  correlated — as,  for  example,  through  common  hor- 
monal or  neurological  pathways  or  through  linkage  disequilibrium — then 
selection  becomes  a self-accelerating  process  promoting  increasingly  elaborate 
displays  and  increasingly  greater  choosiness.  According  to  Fisher  (1930): 

The  two  characteristics  affected  by  such  a process,  namely  plumage  develop- 
ment in  the  male,  and  sexual  preference  for  such  developments  in  the  female, 
must  thus  advance  together,  and  so  long  as  the  process  is  unchecked  by  severe 
counterselection,  will  advance  with  ever-increasing  speed.  In  the  total  absence 
of  such  checks,  it  is  easy  to  see  that  the  speed  of  development  will  be  propor- 
tional to  the  development  already  attained.  There  is  thus,  in  any  situation  in 
which  sexual  selection  is  capable  of  conferring  a great  reproductive  advan- 
tage, the  potentiality  of  a runaway  process  which  will,  however  small  the 
beginnings  from  which  it  arose,  must,  unless  checked,  produce  great  effects, 
and  in  the  later  stages  with  great  rapidity. 

The  ever-accelerating  process  is  called  runaway  sexual  selection,  and  the 
conditions  under  which  it  takes  place  have  been  studied  theoretically  (Lande 
and  Arnold  1985;  Kirkpatrick  and  Barton  1995;  Iwasa  and  Pomiankowski  1995). 

KIN  SELECTION 

One  alternative  type  of  selection,  called  kin  selection,  makes  use  of  an 
extended  concept  of  "fitness."  In  kin  selection,  a positive  selection  for  cer- 
tain alleles  takes  place  indirectly  through  enhanced  reproduction  of  the 
genetic  relatives  of  carriers  of  the  alleles  rather  than  directly  through  an 
increased  fitness  of  the  carriers  themselves.  Kin  selection  has  been  postulat- 
ed in  attempts  to  account  for  the  evolution  of  altruism.  A behavior  is  regard- 
ed as  altruism  if  it  increases  the  fitness  of  other  organisms  at  the  expense  of 
one's  own  fitness.  Altruistic  behavior  is  exhibited  most  dramatically  by 
social  insects  such  as  termites,  ants,  and  bees,  in  which  certain  worker  castes 
exert  their  labors  for  the  care,  protection,  and  reproduction  of  the  queen  and 
her  offspring  but  do  not  reproduce  themselves.  Other,  less  dramatic  exam- 
ples of  altruistic  behavior  include  phenomena  such  as  the  care  of  offspring 
by  their  parents. 

A central  consideration  in  kin  selection  is  that  relatives  have  genes  in  com- 
mon. Therefore,  a gene  that  causes  altruistic  behavior  can  increase  in  fre- 
quency if  the  increase  in  the  recipient's  fitness  as  a result  of  altruism  is 
sufficiently  large  to  offset  the  decrease  in  the  altruist's  own  fitness.  The  essen- 
tials of  the  situation  can  be  made  clear  by  considering  the  case  of  identical 
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twins.  Because  identical  twins  are  genetically  identical,  the  reproduction  of 
one's  twin  is  genetically  equivalent  to  reproduction  by  oneself.  Thus,  it 
makes  no  difference  if  an  altruistic  organism  decreases  its  own  fitness  for  the 
sake  of  an  equal  increase  in  fitness  of  an  identical  twin;  from  an  evolutionary 
point  of  view,  it  is  an  even  trade  because  the  combined  number  of  offspring 
from  both  twins  remains  unchanged.  By  the  same  token,  if  an  altruistic  act 
decreases  the  fitness  of  an  organism  by  an  amount  less  than  the  increase 
gained  by  an  identical  twin,  then  the  altruism  results  in  a net  increase  in  the 
combined  number  of  offspring.  One  would,  therefore,  expect  altruism 
between  identical  twins  to  be  favored  by  natural  selection  as  long  as  the  risk 
to  the  altruist  is  no  greater  than  the  benefit  to  the  recipient. 

These  considerations  of  identical  twins  can  be  extended  to  other  degrees 
of  relationship  as  well,  but  the  risk  to  the  altruist  must  be  correspondingly 
smaller  than  the  benefit  to  the  recipient  because  other  types  of  relatives  share 
fewer  genes  than  identical  twins.  The  break-even  points  for  altruism  toward 
various  degrees  of  relationship  have  been  trenchantly  summarized  by  J.  B.  S. 
Haldane,  who  is  said  to  have  quipped  that  he  would  lay  down  his  life  for  two 
brothers,  four  nephews,  or  eight  cousins.  In  any  case,  fitness  considerations 
that  take  into  account  not  only  an  organism's  own  fitness  but  also  the  fitness 
of  relatives  (other  than  direct  descendants)  constitute  what  is  called  the 
inclusive  fitness  of  the  organism. 

To  be  concrete,  suppose  that  altruism  results  in  a decrease  in  fitness  c of 
the  altruist  that  is  offset  by  an  increase  in  fitness  b in  the  recipient.  The  gene 
for  altruism  increases  in  frequency  if  the  ratio  of  cost  to  benefit  is  great 
enough,  relative  to  the  genetic  relationship  between  the  altruist  and  the  recip- 
ient; that  is,  the  gene  for  altruism  increases  in  frequency  if 


as  shown  first  by  Hamilton  (1964)  and  discussed  in  detail  by  Cavalli-Sforza 
and  Feldman  (1978)  and  Uyenoyama  and  Feldman  (1980).  In  this  context,  r is 
a measure  of  genetic  relationship  between  the  altruist  X and  the  recipient  of 
the  altruism  Y,  defined  as 


r_  2Fxy 
(1  + Fx) 


6.25 


where  Fx  is  the  inbreeding  coefficient  of  the  altruist  X,  and  FXY  is  the  inbreed- 
ing coefficient  of  a hypothetical  offspring  of  X and  Y.  As  illustrated  in  Figure 
6.13,  r equals  the  probability  that  two  gametes  from  X and  Y contain  alleles 
that  are  identical  by  descent,  Fx y,  relative  to  the  probability  that  two  gametes 
from  X contain  alleles  that  are  identical  by  descent,  (1  + Fx)/2.  The  cost- 
benefit  tradeoff  in  Equation  6.24  is  generally  valid  for  weak  selection  when 
Fx  = 0 and  valid  for  additive  alleles  even  when  Fx  / 0 (Aoki  1981). 
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(A)  (B) 


Figure  6. 1 3 Definition  of  the  genetic  relationship  between  an  altruist  X and 
the  recipient  of  the  altruism  Y.  (A)  Two  alleles  chosen  at  random  from  an  organ- 
ism X are  identical  by  descent  with  probability  (1  + Fx)/2  (see  Figure  4.13).  (B) 
Two  alleles  chosen  at  random,  one  from  X and  the  other  from  Y,  are  identical  by 
descent  with  probability  FXY,  which  is  the  inbreeding  coefficient  of  a hypotheti- 
cal offspring  of  X and  Y.  The  ratio  of  FXY  to  (1  + Fx)/2  is  the  appropriate  measure 
of  genetic  relationship  in  the  consideration  of  kin  selection. 


PROBLEM  6.11  For  the  illustrated  pedigrees  (A)  and  (B)  of  full  sib- 
lings shown  in  the  accompanying  figure,  calculate  the  break-even  value 
of  the  benefit  b to  the  recipient  of  altruism  Y,  relative  to  a cost  value  c = 1 
to  the  altruist  X,  in  order  to  ensure  an  increase  in  frequency  of  an  addi- 
tive gene  for  altruism.  Why  are  the  answers  different  in  the  two  cases? 


ANSWER  In  case  (A),  a hypothetical  offspring  of  X and  Y has  an 
inbreeding  coefficient  of  FXY  = (V2)3  + (V2)3  = V4,  and  Fx  = 0.  Therefore, 
r = 2 x V4  = y2,  and  the  break-even  value  of  c/b  = '/2.  Hence,  for  c = 1, 
the  break-even  value  otb  = 2.  (This  calculation  is  the  theoretical  basis 
of  Haldane's  quip  about  laying  down  his  life  for  two  brothers.)  In 
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pedigree  (B),  FXY  = 4 x (V2)5  + 2 x (V2)3  = 3/g/  and  Fx  = 2 x (V2)3  = V4. 
Therefore,  r = 2(3/g)/ (1  + V4)  = 3/5.  For  a cost  of  c = 1,  the  break-even 
value  of  b equals  %.  The  values  differ  in  the  two  cases  because  of  the 
differing  inbreeding.  In  case  (B),  even  though  X is  inbred,  the  break- 
even value  of  b is  smaller  because  of  the  closer  genetic  relationship 
between  X and  Y. 


INTERDEME  SELECTION  AND  THE  SHIFTING  BALANCE  THEORY 

Another  alternative  type  of  selection  arises  in  the  context  of  interdeme  selec- 
tion, which  takes  place  between  semi-isolated  subpopulations  ( demes ) of  the 
same  species.  If  subpopulations  composed  of  certain  genotypes  are  more 
likely  to  become  extinct  and  have  their  vacated  habitats  recolonized  by 
migrants  from  other  subpopulations  composed  of  other  genotypes,  then  the 
more  successful  subpopulations  can,  in  some  sense,  be  considered  as  having 
a greater  "fitness"  than  the  less  successful  ones.  Since  this  concept  of  popula- 
tion fitness  is  a characteristic  of  the  entire  population  and  not  merely  the 
average  fitness  of  the  genotypes  within  it  (w),  interdeme  selection  is  outside 
the  realm  of  most  conventional  models  of  selection.  Interdeme  selection  is 
one  type  of  group  selection  (Wilson  1983). 

Interdeme  selection  plays  an  essential  role  in  the  shifting  balance  theory 
of  evolution  of  Wright  (1977  and  earlier).  In  the  shifting  balance  theory,  a 
large  population  that  is  subdivided  into  a set  of  small,  semi-isolated  sub- 
populations (demes)  has  the  best  chance  for  the  subpopulations  to  explore 
the  full  range  of  the  adaptive  topography  and  to  find  the  highest  fitness  peak 
on  a convoluted  adaptive  surface.  If  the  subpopulations  are  sufficiently 
small,  and  the  migration  rate  between  them  is  sufficiently  small,  then  the 
subpopulations  are  susceptible  to  random  genetic  drift  of  allele  frequencies, 
which  allows  them  to  explore  their  adaptive  topography  more  or  less  inde- 
pendently. In  any  subpopulation,  random  genetic  drift  can  result  in  a tem- 
porary reduction  in  fitness  that  would  be  prevented  by  selection  in  a larger 
population,  and  so  a subpopulation  can  pass  through  a "valley"  of  reduced 
fitness  and  possibly  end  up  "climbing"  a peak  of  fitness  higher  than  the 
original.  Any  lucky  subpopulation  that  reaches  a higher  adaptive  peak  on 
the  fitness  surface  increases  in  size  and  sends  out  more  migrants  to  nearby 
subpopulations,  and  the  favorable  gene  combinations  are  gradually  spread 
throughout  the  entire  set  of  subpopulations  by  means  of  interdeme 
selection. 

The  shifting  balance  process  includes  three  distinct  phases: 

1.  An  exploratory  phase,  in  which  random  genetic  drift  plays  an  important 

role  in  allowing  small  subpopulations  to  explore  their  adaptive  topography. 


260 


Chapter  6 


2.  A phase  of  mass  selection,  in  which  favorable  gene  combinations  created 
by  chance  in  the  random  drift  phase  become  rapidly  incorporated  into 
the  genome  of  local  subpopulations  by  the  action  of  natural  selection. 

3.  A phase  of  interdeme  selection,  in  which  the  more  successful  demes 
increase  in  size  and  rate  of  migration;  the  excess  migration  shifts  the 
allele  frequencies  of  nearby  subpopulations  until  they  also  come  under 
the  control  of  the  higher  fitness  peak.  The  favorable  genotypes  thereby 
become  spread  throughout  the  entire  population  in  an  ever-widening 
distribution.  Where  the  region  of  spread  from  two  such  centers  overlaps, 
a new  and  still  more  favorable  genotype  may  be  formed  and  itself 
become  a center  for  interdeme  selection.  In  this  manner,  the  whole  of  the 
adaptive  topography  can  be  explored,  and  there  is  a continual  shifting  of 
control  from  one  adaptive  peak  to  control  by  a superior  one. 

The  shifting  balance  theory  has  played  an  important  role  in  evolutionary 
thinking,  in  part  because  of  its  use  of  mountain-climbing  terms  as  tropes  for 
stages  in  the  evolutionary  progress:  "exploration"  of  the  adaptive  topogra- 
phy, chance  "discovery"  of  a route  to  a higher  adaptive  peak,  and  ultimately 
the  "conquest"  of  the  highest  adaptive  peak  by  the  whole  species.  However, 
as  a comprehensive  theory  of  evolution,  many  aspects  of  the  theory  remain 
untested.  For  the  theory  to  work  as  envisaged,  the  interactions  between  alle- 
les must  often  result  in  complex  adaptive  topographies  with  many  peaks  and 
valleys.  The  population  must  be  split  up  into  smaller  subpopulations,  which 
must  be  small  enough  for  random  genetic  drift  to  be  important,  but  large 
enough  for  mass  selection  to  fix  favorable  combinations  of  alleles.  Although 
migration  between  demes  is  essential,  neighboring  demes  must  be  sufficient- 
ly isolated  for  genetic  differentiation  to  take  place,  but  sufficiently  connected 
for  favorable  gene  combinations  to  spread.  Because  of  uncertainly  about  the 
applicability  of  these  assumptions,  the  shifting  balance  process  remains  a pic- 
turesque metaphor  that  is  still  largely  untested.  However,  computer  simula- 
tions have  been  carried  out  to  investigate  the  range  of  magnitudes  of  the  key 
parameters  that  are  necessary  for  the  shifting  balance  process  to  be  effective; 
these  parameters  include  the  size  of  the  subpopulations,  the  rate  of  migration 
and  range  of  dispersal  of  the  migrants,  the  degree  of  epistasis  between  genes, 
and  the  rate  of  recombination  (Bergman  et  al.  1995).  Some  empirical  studies 
have  also  explored  the  partitioning  of  genetic  variance  within  and  between 
groups  for  traits  associated  with  fitness  (Wade  and  Goodnight  1991). 

One  important  implication  of  interdeme  selection  is  that  alleles  that  are 
harmful  in  themselves  may  nevertheless  be  favored  because  they  are  benefi- 
cial to  the  group.  This  principle  is  illustrated  in  the  model  in  Table  6.4,  where 
the  allele  A'  is  harmful  to  organisms  within  demes  but  favorable  to  the  deme 
as  a whole.  Equation  6.11  implies  that,  within  the  ith  deme,  A q,  = -07,(1  - q,) 
(assuming  that  zv  = 1).  Averaging  across  all  of  the  subpopulations,  the  change 
in  allele  frequency  resulting  from  selection  within  subpopulations,  Aqw, 
equals  -cq  (1  -q  )(1  - F),  where  F is  the  fixation  index  FST  discussed  in  Chap- 
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TABLE  6.4  MODEL  OF  INTERDEME  SELECTION 

Genotype 

AA 

AA' 

A'A' 

Frequency  in  deme  i 

P,2 

2Pfl, 

<h2 

Within-population  fitness 

1 

1 - c 

1-2  c 

Between-population  fitness 

of  deme  i 

1 + 2 (b-  c)q, 

ter  4.  At  the  same  time,  within-subpopulation  selection  takes  place,  inter- 
deme  selection  favors  demes  containing  A',  and  the  change  in  allele  fre- 
quency resulting  from  between-subpopulation  selection,  Aqb,  equals 
2 (b  - c)q(  1 - q)F,  as  shown  by  Crow  and  Aoki  (1982).  Putting  the  within- 
subpopulation  and  between-subpopulation  selection  together,  the  total 
change  in  the  frequency  of  A'  is 

A q = A qw  + A qh  = -cq(  1 - q){l  - F)  + 2 (b  - c)q(  1 - cj)F  6.26 


The  terms  on  the  right-hand  side  can  be  interpreted  by  considering  the 
extremes  of  F = 0 and  F = 1.  When  F = 0,  there  is  no  population  substructure, 
which  means  that  all  subpopulations  have  the  same  allele  frequency  q ; in  this 
case,  the  change  in  allele  frequency  is  just  -cq{  1 - q).  At  the  other  extreme, 
when  F = 1,  each  subpopulation  is  fixed  for  either  A or  A’ , and  the  proportion 
fixed  for  A'  equals  q.  The  between-subpopulation  selection  is  therefore 
analogous  to  selection  between  alleles  in  a haploid  organism  in  which  the 
fitnesses  of  A and  A'  demes  are  in  the  ratio  1 : 2 (b  - c).  In  this  case,  therefore, 
the  change  in  allele  frequency  is  2 (b  - c)  q(  1 - q)  (from  Equation  6.6,  assum- 
ing that  w = 1). 

Equation  6.26  implies  that  Aq  > 0 if 


b-c 
— — > 
c 


1-F 
2 F 


6.27 


This  is  the  condition  necessary  for  selection  between  demes  to  override 
selection  within  demes,  and  the  formulation  is  quite  general  (Crow  and  Aoki 
1982).  A biological  interpretation  of  the  inequality  in  Equation  6.27  can  be 
inferred  by  comparison  with  the  break-even  point  for  kin  selection  given  in 
Equations  6.24  and  6.25.  Expressing  6.27  in  terms  of  r = 2F/(1  + F),  which 
means  that  F = r/( 2-  r),  yields  c/b  < r;  this  condition  is  identical  to  Equation 
6.24.  In  these  models,  the  equivalence  between  kin  selection  and  interdeme 
selection  results  from  the  shared  remote  ancestry  of  the  members  of  each 
subpopulation  caused  by  random  genetic  drift  among  the  subpopulations. 
The  members  of  each  subpopulation  are  related  by  kinship,  and  so  interdeme 
selection  is  the  same  phenomenon  as  kin  selection;  the  break-even  point  is 
that  at  which  the  benefit  b to  one's  kin  through  interdeme  selection  equals  the 
cost  to  one's  self  c through  direct  selection  against  the  A'  allele. 


262 


Chapter  6 


If  there  are  a large  number  of  subpopulations,  each  of  size  N,  that 
exchange  migrants  in  such  a way  that  m is  the  proportion  of  genes  in  each 
deme  that  are  exchanged  each  generation  for  genes  chosen  at  random  from 
the  other  demes,  then  the  approximate  value  of  F at  equilibrium  is  given  by 
Equation  5.17  as  F = 1 /(I  + 4 Nm).  Consequently,  the  right-hand  side  of  Equa- 
tion 6.27  becomes  2 Nm.  In  other  words,  (1  - F)/2F  equals  the  number  of 
migrant  diploid  organisms  per  generation.  We  therefore  conclude  from  Equa- 
tion 6.27  that  selection  between  demes  overrides  selection  within  demes  only 
when  the  benefit  to  the  group  (b  - c),  relative  to  the  cost  to  the  individual 
organism  (c),  is  greater  than  the  average  number  of  migrant  organisms  per 
generation.  This  principle  defines  a rather  stringent  limit  above  which  migra- 
tion among  demes  cancels  any  possible  effects  of  interdeme  selection. 

SUMMARY 

Natural  selection  can  take  place  in  many  different  ways.  The  simplest  case  is 
that  in  a haploid  organism  in  which  the  relative  fitnesses  of  the  alternative 
genotypes  are  constant.  Models  of  discrete  generations  and  of  continuous 
exponential  growth  are  presented.  In  the  discrete  model,  the  relative  fitness- 
es are  called  Darwinian  fitnesses;  in  the  continuous  model,  they  are  called 
Malthusian  fitnesses.  The  relationship  is  that  In  zv  = m,  where  w and  m are  the 
Darwinian  and  Malthusian  fitnesses,  respectively. 

In  a diploid  organism,  continuous  population  growth  is  difficult  to  model 
when  the  genotypes  differ  in  their  rates  of  reproduction.  The  "standard" 
diploid  model  is  that  of  discrete  generations  in  which  the  genotypes  may  dif- 
fer in  the  probability  of  survival  from  fertilization  to  adulthood  (survivorship 
or  viability  selection)  but  are  equal  in  fertility.  In  such  a model  with  two  alle- 
les, A and  a,  and  constant  fitnesses  of  the  diploid  genotypes,  four  outcomes  of 
selection  are  possible:  A becomes  fixed;  a becomes  fixed;  there  is  a globally 
stable  equilibrium;  or  there  is  an  unstable  equilibrium.  Fixation  of  A or  a 
results  from  directional  selection  in  which  either  AA  or  aa  is  favored  and  the 
fitness  of  the  heterozygous  genotype  is  intermediate  between  the  homozy- 
gous genotypes  (or  possibly  equal  to  one  of  them).  The  stable  equilibrium 
results  from  heterozygote  superiority  (overdominance),  in  which  the  fitness 
of  the  heterozygous  genotype  exceeds  that  of  both  homozygous  genotypes. 
At  the  stable  equilibrium  of  allele  frequency,  the  average  fitness  in  the  popu- 
lation w is  maximized.  An  unstable  equilibrium  arises  when  the  fitness  of  the 
heterozygous  genotype  is  smaller  than  that  of  both  homozygous  genotypes. 
The  outcome  of  selection  then  depends  on  the  initial  conditions;  fixation  of 
either  A or  a takes  place  according  to  whether  the  initial  frequency  of  A is 
greater  than  or  less  than  the  unstable  equilibrium  frequency. 

Mutation-selection  balance  refers  to  the  maintenance  of  a harmful  allele  in 
a population  at  a low  equilibrium  frequency  because,  in  every  generation,  the 
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elimination  of  preexisting  harmful  alleles  by  selection  is  offset  by  the  introduc- 
tion of  new  harmful  alleles  by  mutation.  For  a completely  recessive  allele  in 
which  the  relative  fitness  of  the  homozygous  recessive  genotype  is 
1 - s,  the  equilibrium  frequency  of  the  harmful  allele  is  given  by  cj  = Vp/s, 
where  p is  the  rate  of  mutation  per  generation  of  the  wildtype  allele  to  the 
harmful  allele.  For  a partially  dominant  allele,  the  relative  fitnesses  of  the  het- 
erozygous and  homozygous  genotypes  carrying  the  harmful  allele  are  1 - hs 
and  1 - s,  where  h is  the  degree  of  dominance.  In  this  case,  the  equilibrium 
allele  frequency  is  given  approximately  by  cj  = \i/hs.  An  important  implica- 
tion of  these  formulas  is  that  a small  degree  of  dominance  of  a "recessive" 
allele  has  a disproportionate  effect  in  decreasing  the  allele  frequency  at  equi- 
librium. Another  important  implication  is  the  Haldane-Muller  principle, 
which  states  that,  at  mutation-selection  balance,  the  total  genetic  load  (mea- 
sured as  the  product  of  the  genotype  frequency  times  the  decrease  in  fitness 
of  the  genotype)  is  independent  of  the  fitnesses  and  depends  only  on  the  rate 
of  recurrent  mutation. 

In  nature,  selection  must  often  be  expected  to  have  a more  complex  mech- 
anism than  that  of  differential  survivorship  envisaged  in  the  standard  model. 
Among  the  more  complex  types  of  selection  are  frequency-dependent  selec- 
tion, density-dependent  selection,  fecundity  selection,  selection  in  age-struc- 
tured populations,  selection  when  there  are  heterogeneous  environments, 
diversifying  selection  favoring  rare  alleles  or  genotypes,  differential  selection 
in  the  sexes,  selection  for  X-linked  genes,  gametic  selection,  meiotic  drive 
(non-Mendelian  segregation),  multiple-alleles  selection,  multiple-loci  selec- 
tion, and  sexual  selection.  Multiple  loci  are  a particularly  important  source  of 
complexity  even  when  the  fitness  differences  are  entirely  due  to  survivor- 
ship. In  particular,  with  strong  epistasis  and  tight  linkage,  there  may  be  mul- 
tiple stable  interior  equilibria  and  the  equilibria  may  not  coincide  with  points 
of  maximum  average  fitness.  With  weak  epistasis  and  loose  linkage,  howev- 
er, the  average  fitness  in  the  population  usually  does  tend  to  increase. 

Extended  concepts  of  fitness  can  include  the  effects  of  selection  acting  on 
groups  of  relatives  or  on  subpopulations.  Kin  selection  invokes  the  concept  of 
inclusive  fitness,  which  embraces  not  only  an  organism's  own  fitness  but  also 
the  fitness  of  its  relatives  (exclusive  of  direct  descendants).  Kin  selection  has 
been  invoked  to  explain  the  evolution  of  many  behavioral  traits  that  appear 
to  be  detrimental  to  the  individual  organism  but  beneficial  to  its  relatives. 
The  most  dramatic  examples  are  found  in  social  insects,  in  which  certain 
organisms  are  reproductively  sterile  and  devote  their  lives  to  the  care  and 
feeding  of  the  queen  and  the  protection  of  the  colony.  Generally  speaking, 
alleles  for  altruistic  behavior  can  increase  in  frequency  if  the  loss  in  fitness  of 
the  altruist  is  offset  by  the  increase  in  inclusive  fitness  to  the  beneficiaries  of 
the  altruism.  More  precisely,  for  additive  alleles,  the  condition  for  increase  in 
frequency  of  an  allele  predisposing  to  altruism  is  c/b  < r,  where  c and  b are 
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the  fitness  cost  to  the  altruist  X and  benefit  to  the  relative  Y,  respectively,  and 
r = 2FXY/(l+Fx). 

Interdeme  selection  plays  an  important  role  in  the  shifting  balance  theo- 
ry of  evolution.  According  to  this  theory,  adaptive  topographies  are  highly 
complex  surfaces  with  many  peaks  and  valleys.  In  small,  partially  isolated 
subpopulations,  random  genetic  drift  promotes  the  random  exploration  of 
the  topography.  When,  by  chance,  a subpopulation  comes  under  the  control 
of  a higher  fitness  peak,  mass  selection  takes  precedence  and  rapidly  multi- 
plies the  favored  gene  combinations.  Excess  migration  from  the  successful 
subpopulation  shifts  the  allele  frequencies  in  surrounding  subpopulations 
and,  through  repetition  of  the  selection  process,  the  favored  gene  combina- 
tions progressively  spread  in  waves  throughout  the  entire  population.  Influ- 
ential as  metaphor,  the  shifting  balance  theory  has  not  yet  been  adequately 
evaluated  as  an  accurate  description  of  the  principal  mechanism  of  evolu- 
tionary change. 

PROBLEMS 

1.  Suppose  that  in  the  zth  generation  of  a haploid  population  the  fitnesses  of 
A and  a are  1 : s,.  Show  that  p„/q„  = (p0/ qosoS\S2  ■ ■ ■ sn_i).  If  this  is  written  as 
Pn/cjn  = Po/ <7os">  then  how  can  s be  interpreted? 

2.  If  the  fitnesses  of  AA,  Aa,  act  are  1.0,  0.9,  0.6,  and  p0  = 0.7,  calculate  p\,  p2l 
and  p3,  the  allele  frequencies  after  1,  2,  and  3 generations  of  selection. 

3.  Calculate  the  equilibrium  allele  frequency  with  overdominance  when  the 
fitnesses  of  AA,  Aa,  and  aa  are,  respectively: 

a.  0.300,1,0.700. 

b.  0.930,1,0.970. 

c.  0.993,1,0.997. 

4.  Calculate  w for  wn  = 0.9,  W\2  = 1/  w22  = 0.6,  and  p = 0.8,  assuming  random 
mating.  Does  any  other  p give  a larger  w?  Why  or  why  not? 

5.  If  a rare  allele  that  is  lethal  when  homozygous  decreases  in  frequency  by 
1%  each  generation  (i.e.,  if  = 0.99q),  then  what  is  the  selection  coefficient 
against  heterozygotes?  (Hint:  Assume  that  qh  is  small  compared  to  1.) 

6.  If  selection  is  not  too  intense,  an  additive  gene  giving  fitnesses  1 + s, 
1 + s/2,  and  1 in  AA,  Aa,  and  aa  will  increase  in  frequency  approximately 
according  to  In (pjqf  = ln(p0/q0)  + (s/2)f.  Calculate  the  approximate 
number  of  generations  required  to  evolve  significant  insecticide  resis- 
tance in  an  insect  population  when  s = V2  and  p0  = 10~\  Significant  resis- 
tance in  the  population  may  be  taken  as  p,  = lO-1.  Show  that,  when  pt/po 
« 1,  t = (2/s)ln(p,/p0). 

7.  Show  that  a random  mating  diploid  population  with  fitnesses  1, 1 - s,  and 
(1  - s)2  for  AA,  Aa,  and  aa  gives  the  same  change  in  the  allele  frequency  p 
of  A as  a haploid  population  with  fitnesses  1 and  1 - s of  A and  a. 


Darwinian  Selection 


2 65 


8.  If  selection  is  not  too  strong,  the  time  required  for  the  allele  frequency  of 
a favored  dominant  allele  to  change  from  p0  to  pt  is  given  by 

ln(Pf/ Qt)  + (1/  cjt ) = [lr>(po/ Po)  + (1/ %)]  + sf 

Use  this  equation  to  derive  the  analogous  equation  for  a favored 
recessive. 

9.  The  following  equation  has  equilibria  at  p = 0,  V2  , and  1.  Classify  the 
equilibria  as  to  stability.  If  there  is  a stable  equilibrium,  is  it  locally  or 
globally  stable? 

AP  = p(V2  -p)(i-p) 

10.  Show  that  the  allele  frequency  of  a recessive  lethal  in  generation  n is 
given  by  q„  = q0/ (1  + nq0).  (Hint:  It  is  easiest  to  derive  an  expression  first 
for  l/q„.)  How  many  generations  are  required  to  reduce  the  allele  fre- 
quency by  half? 

11.  The  mutation  rate  to  a dominant  gene  for  neurofibromatosis  is  approxi- 
mately 9 x 1CT5  and  the  reproductive  fitness  of  affected  individuals  is  esti- 
mated as  i/2.  What  is  the  expected  equilibrium  frequency  of  affected 
individuals  at  birth? 

12.  What  is  the  equilibrium  frequency  of  a recessive  gene  arising  with  a 
mutation  rate  of  4 x 10  h and  a reproductive  fitness  in  homozygotes  of 
0.8?  What  would  it  be  if  the  gene  were  partially  dominant  with  h = 0.05? 

13.  What  is  the  equilibrium  frequency  of  a recessive  gene  arising  with  a 
mutation  rate  of  10  6 with  a fitness  of  0.4  in  homozygotes?  How  much 
would  this  be  reduced  if  the  homozygotes  did  not  reproduce  at  all? 

14.  For  a rare  allele  maintained  at  an  equilibrium  frequency  of  q = ]i/h, 
where  h is  the  selection  coefficient  against  heterozygotes,  show  that  the 
proportion  of  heterozygous  zygotes  resulting  from  new  mutations  is 
approximately  equal  to  h. 

15.  A polymorphism  is  said  to  be  protected  if  all  of  the  fixation  states  are 
unstable  equilibria.  Suppose  the  viabilities  of  males  and  females  are  as 
follows: 

AA  Aa  aa 

Females  0.9  1 0.8 

Male  1.0  v12  0.5 

What  is  the  smallest  value  of  u12  that  ensures  a protected  polymor- 
phism? (Hint:  Some  algebra  shows  that  a condition  for  polymorphism  is 
zvu/ wn  + u12/ vn  > 2 and  wn/w2i  + V\2/v22  > 2.) 

16.  If  allele  a is  a recessive  lethal  in  zygotes  and  the  relative  fitness  of  A : a 
gametes  is  1 - s : 1,  then  what  is  the  equilibrium  allele  frequency  of  a ? 


266 


Chapter  6 


(Hint:  The  recursion  simplifies  greatly  for  the  case  of  a recessive  lethal, 
and  equilibrium  is  given  by  p = vxwu/[(v\+v^)wu  - rWn]-) 

17.  In  a Drosophila  population  cage  containing  a meiotically  driven  chromo- 
some known  as  segregation  distorter,  the  equilibrium  frequency  of  the 
driven  chromosome  was  approximately  0.125  and  the  segregation  ratio  in 
heterozygotes  was  about  k = 0.75  (Hiraizumi,  Sandler,  and  Crow  1960). 
The  meiotic  drive  chromosome  is  homozygous  lethal  in  both  sexes. 
The  equilibrium  between  viability  and  meiotic  drive  in  this  case  is  p = 
2 (k  - \)wX2/(\  - 2wu).  Use  this  equation  to  estimate  the  approximate  value 
of  zv12  consistent  with  these  data. 

18.  In  a multiple  allele  system  in  which  each  heterozygote  is  superior  to  the 
homozygotes  for  the  alleles  it  contains,  why  are  all  alleles  not  maintained 
by  selection? 

19.  The  viabilities  of  genotypes  A' A',  A'A,  and  AA  are  0.5, 1,  and  0.7,  respec- 
tively. If  the  initial  frequency  of  allele  A'  is  .05,  what  will  the  frequency  be 
when  the  population  comes  to  equilibrium?  If  a mutation  occurs,  intro- 
ducing a novel  allele  A",  such  that  the  fitnesses  of  A"A",  A" A',  and  A"A 
are  all  0.8,  determine  whether  this  allele  will  increase  in  frequency. 

20.  Suppose  alleles  A,,  A2,  A3,  and  A4  are  additive  in  their  effects,  and  the 
homozygote  fitnesses  are  AXAX : 0.8,  A2A2 : 0.6,  A3A3 : 0.4,  and  A4A4 : 0.2. 
What  are  the  heterozygote  fitnesses?  If  all  alleles  are  equally  frequent, 
what  is  the  mean  fitness  for  this  locus? 
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or  each  generation  there  is  an  element  of  chance  in  the  drawing 
of  gametes  that  will  unite  to  form  the  next  generation.  Chance 
alone  can  result  in  changes  in  allele  frequency,  and  because  the 
allele  frequencies  do  not  change  in  any  predetermined  way  by  this  sampling 
process,  the  process  is  known  as  random  genetic  drift.  In  Chapter  5 we 
looked  at  some  of  the  basic  principles  of  how  random  genetic  drift  affects  lev- 
els of  variation  in  populations,  but  the  subtlety  and  importance  of  drift  are 
such  that  we  will  now  devote  this  chapter  to  the  subject. 


RANDOM  GENETIC  DRIFT  AND  BINOMIAL  SAMPLING 

Consider  a large  population  at  Hardy- Weinberg  equilibrium  with  alleles  A 
and  a at  equal  frequencies  p = q = y2.  In  this  population,  the  genotype  fre- 
quencies are  V4  AA,  V2  An  and  V4  aa.  Suppose  four  individuals  are  drawn  at 
random  from  this  population  to  start  a colony.  It  is  possible,  by  chance  alone, 
that  the  sample  will  consist  of  4 AA  individuals.  (This  chance  is 
(Vd  = 1 / 256.)  Similarly,  it  is  possible  that  all  four  will  be  aa.  Any  other  possi- 
ble sample  could  have  been  drawn,  and  it  is  not  difficult  to  work  out  the 
probability  for  each  type  of  sample.  If  the  colony  remains  at  just  four  indi- 
viduals, this  same  kind  of  random  sampling  occurs  each  generation.  At  each 
generation,  there  is  an  opportunity  for  a large  change  in  gene  frequency 
caused  purely  by  this  process  of  sampling.  One  consequence  of  drift  soon 
becomes  clear— eventually  the  population  will  have  either  all  A alleles  or  all 
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Sample 
2 N gametes 


Figure  7.1  The  gene  frequencies  and  sampling  that  occur  in  the  Wright- 
Fisher  model.  Initially  there  are  N diploid  adults  with  a gene  whose  frequency 
is  p0.  The  adults  make  an  infinite  number  of  gametes  having  the  same  allele  fre- 
quency. From  this  pool,  2 N gametes  are  drawn  at  random  to  constitute  the  N 
diploid  individuals  for  the  next  generation. 


a alleles.  Once  the  population  reaches  such  a "fixation"  state,  it  is  stuck.  Only 
new  mutations  or  migrants  into  the  population  can  reintroduce  variation. 

In  the  example  above  we  sampled  four  diploid  individuals  each  genera- 
tion. For  our  purposes,  this  is  equivalent  to  drawing  eight  gametes  at  random 
from  a pool  of  gametes.  For  example,  if  eight  gametes  are  drawn  from  a pop- 
ulation with  p = y2,  there  are  nine  possible  outcomes,  having  0,  1,  2,  3, . . . , 8 
copies  of  the  A allele  and  the  remaining  copies  being  the  a allele.  The  proba- 
bility of  each  of  the  nine  possibilities  is  given  by  the  binomial  distribution, 
first  introduced  in  Chapter  1.  For  the  case  of  fixation,  we  need  to  find  the 
probability  of  drawing  eight  copies  of  the  A allele.  Each  draw  is  considered 
independent  of  the  other  draws,  and  each  has  a chance  of  V2  of  yielding  an  A. 
This  means  that  the  probability  of  drawing  eight  consecutive  A alleles  is 
(V2)8  = 1/256.  It  is  no  coincidence  that  this  is  the  same  as  the  probability  of 
drawing  four  AA  genotypes  as  described  above. 

In  sampling  gametes  from  a finite  population,  the  sampling  process  is 
depicted  in  Figure  7.1.  In  each  generation  there  are  N diploid  individuals  in 
the  population.  Regardless  of  the  way  fertilization  occurs,  we  can  imagine 
the  sampling  process  to  be  one  of  sampling  with  replacement,  such  that  the 
diploid  individuals  contribute  to  an  essentially  infinite  gamete  pool  whose 
allele  frequency  is  the  same  as  the  allele  frequency  in  the  adults.  From  this 
infinite  gamete  pool,  2 N gametes  are  drawn  and  unite  at  random  to  form  the 
next  generation.  Under  this  kind  of  sampling  process,  the  distribution  of  fre- 
quencies of  gametes  is  expected  to  be  binomial. 


PROBLEM  7.1  Suppose  there  are  a thousand  round  pea  seeds  and  a 
thousand  wrinkled  pea  seeds  in  a soup  pot.  Enumerate  all  possible 
samples  of  four  seeds  drawn  from  the  pot,  and  calculate  the  probabil- 
ity of  each. 
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ANSWER  The  chance  of  drawing  a round  seed  is  V2  (as  is  the  chance 
of  drawing  a wrinkled  seed).  The  chance  of  drawing  four  round  seeds 
is  roughly  (y2)4  = V16,  since  the  fraction  of  round  seeds  remains  fairly 
close  to  l/ 2 even  after  a few  are  drawn.  The  chance  of  drawing  all  four 
wrinkled  seeds  is  also  X/Xb.  There  are  four  ways  to  get  three  round  and 
one  wrinkled  seed:  RRRW,  RRWR,  RWRR,  and  WRRR,  and  each  of 
these  has  chance  '/16.  Similarly,  there  are  four  ways  to  get  three  wrin- 
kled and  one  round  seed:  WWWR,  WWRW,  WRWW,  and  RWWW, 
and  again,  each  of  these  four  possibilities  had  probability  V16.  Finally, 
we  could  draw  two  round  and  two  wrinkled  seeds,  and  such  a sample 
could  be  drawn  in  any  of  six  possible  orders:  RRWW,  RWRW,  RWWR, 
WRRW,  WRWR,  and  WWRR.  Each  order  has  chance  V16,  so  the  total 
chance  of  getting  two  round  and  two  wrinkled  seeds  is  6/!6  = %.  We 
have  exhaustively  enumerated  all  16  possible  samples  of  four  seeds, 
and  since  each  of  the  16  possibilities  has  chance  V16,  the  sum  of  the 
probabilities  of  the  events  is  1 . This  is  a check  that  we  have  considered 
all  possibilities.  Note  that  the  binomial  distribution  (Equation  7.1) 
makes  these  calculations  much  easier. 


To  take  a specific  example,  a population  of  nine  diploid  organisms  arises 
from  a sample  of  just  18  gametes,  but  the  gametes  can  be  thought  of  as  being 
sampled  from  an  essentially  infinite  pool  of  gametes.  Because  small  samples 
are  frequently  not  representative,  an  allele  frequency  in  the  sample  may  dif- 
fer from  that  in  the  entire  pool  of  gametes.  In  fact,  if  the  number  of  gametes 
in  a sample  is  represented  as  2N  (in  this  example,  IN  = 18),  the  probability 
that  the  sample  contains  exactly  i alleles  of  type  A is  the  binomial  probability 
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where 


IN 


means  (2N)\/i\(2N  - z) !;  p and  q are,  respectively,  the  allele  fre- 


quencies of  A and  a in  the  entire  pool  of  gametes  (p  + q = 1);  and  i takes  on 
any  integer  value  between  0 and  2 N.  The  new  allele  frequency  in  the  popula- 
tion (call  it  p')  is  therefore  z/2N  because,  by  definition,  the  allele  frequency  of 
A equals  the  number  of  /l  alleles  (in  this  case  i)  divided  by  the  total  (in  this 
case  2 N).  In  the  next  generation,  the  sampling  process  occurs  anew,  and  the 
new  probability  of  a prescribed  number  of  A alleles  occurring  in  the  2N 
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gametes  is  given  by  the  binomial  probability  above,  with  p now  replaced  by 
p'  and  q by  1 - p'.  Thus,  the  allele  frequency  may  change  at  random  from  gen- 
eration to  generation.  Computer-generated  examples  based  on  random  num- 
bers are  shown  in  Figure  7.2.  Each  line  in  Figure  7.2A  gives  the  number  of  A 
alleles  in  20  successive  generations  of  random  genetic  drift  in  a population  of 
size  N = 9 (so  2 N = 18).  As  you  can  see,  individual  populations  behave  very 
erratically.  In  seven  populations,  the  A allele  became  fixed  (that  is,  p = 1);  in 
five  populations,  A became  lost  (that  is,  p = 0).  The  other  eight  populations 
remained  unfixed  (A  was  neither  fixed  nor  lost),  but  the  final  allele  frequen- 
cy among  the  unfixed  populations  was  as  likely  to  be  one  value  as  another. 
Figure  7.2B  shows  the  same  kind  of  simulation,  except  now  2 N = 100.  With  a 
larger  population  size,  the  rate  at  which  populations  go  to  fixation  is  evi- 
dently slower.  The  principal  conclusion  from  Figure  7.2  is  that  allele  frequen- 
cies behave  so  erratically  in  any  one  population  that  prediction  is  virtually 
impossible. 

Although  changes  in  allele  frequency  due  to  random  genetic  drift  in  any 
individual  population  may  defy  prediction,  the  average  behavior  of  allele  fre- 
quencies in  a large  number  of  populations  can  be  predicted.  Consider  a large 
number  of  populations  all  starting  at  the  same  time  with  the  same  allele  fre- 
quency and  same  population  size  N.  Each  of  these  populations  is  assumed 
to  undergo  drift  independently  of  the  other  populations.  Except  for  their 
finite  size,  the  subpopulations  are  assumed  to  satisfy  all  the  assumptions  of 
the  Hardy-Weinberg  model,  with  the  additional  stipulations  that  (1)  the 
number  of  males  and  females  is  equal,  and  (2)  each  individual  has  an  equal 
chance  of  contributing  successful  gametes  to  the  next  generation.  The  key 
point  illustrated  in  Figure  7.3  is  that  we  can  describe  how  these  populations 
change  in  allele  frequency  by  considering  time  slices  through  the  graph,  and 
tallying  a histogram  of  the  counts  of  populations  having  each  specified  allele 
frequency.  Initially,  the  populations  will  all  be  close  to  the  starting  allele  fre- 
quency. As  time  passes,  the  populations  "drift"  apart,  and  eventually  they  are 
spread  over  all  possible  allele  frequencies.  Finally,  as  we  will  see,  each  popu- 
lation must  go  to  fixation  for  one  allele  or  the  other. 

The  trick  in  understanding  drift  is  to  learn  how  to  deduce  the  distribu- 
tions of  allele  frequencies  plotted  in  Figure  7.3.  We  just  described  what  would 
happen  after  one  generation — the  set  of  populations  would  have  a range  of 
allele  frequencies  as  described  by  the  binomial  distribution.  The  binomial 
distribution  gives  us  the  probability  that  a population  has  allele  frequency  p' 
after  one  generation  of  drift.  If  we  consider  1000  populations  all  starting  at  p, 
the  binomial  distribution  gives  us  the  fraction  of  those  populations  with 
allele  frequency  p'.  What  about  the  following  generation?  For  each  popula- 
tion, one  can  imagine  the  whole  sampling  process  as  starting  over  again.  The 
population  does  not  remember  where  it  was  the  previous  generation,  and  so 
the  binomial  sampling  occurs  again.  But  this  time,  the  allele  frequency  is  p', 
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Figure  7.2  Computer  simulations  of  the  Wright-Fisher  model  of  random 
genetic  drift.  Each  line  represents  a population  of  size  (A)  2 N = 18  or  (B)  2 N = 
100,  simulated  for  20  generations.  Each  generation  alleles  are  sampled  with 
replacement  as  described  in  the  text.  An  allele  frequency  of  p = 0.5  in  A implies 
that  there  are  nine  copies  of  the  A allele,  and  nine  copies  of  the  a allele.  In  B,  an 
allele  frequency  of  0.5  implies  50  copies  of  each  allele.  Note  that  the  larger  popu- 
lation size  in  B results  in  smaller  oscillations  of  allele  frequency,  and  a slower 
rate  of  fixation. 
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Figure  7.3  The  model  of  random  genetic  drift  can  be  seen  by  imagining  a 
large  collection  of  populations  undergoing  the  process  of  repeated  sampling.  As 
the  top  part  of  the  figure  indicates,  the  populations'  allele  frequencies  change 
erratically,  and  tend  to  drift  apart.  At  time  intervals,  a snapshot  of  the  popula- 
tions would  produce  distributions  of  allele  frequencies  whose  variance  increases 
over  time. 


and  this  value  must  be  the  frequency  used  in  Equation  7.1.  Each  of  the  1000 
populations  may  have  a different  p'  after  one  generation  of  drift,  so  to  get  the 
second  generation,  we  need  to  calculate  the  binomial  distribution  1000  times 
and  add  the  values  up.  Fortunately,  R.A.  Fisher  and  Sewall  Wright  figured 
out  an  easier  way  to  do  this,  which  is  described  in  the  next  section. 

An  experiment  designed  along  the  same  lines  as  the  one  given  in  Figure 
7.3  is  shown  in  Figure  7.4.  In  this  study,  the  history  of  19  generations  of  ran- 
dom genetic  drift  in  107  subpopulations  of  Drosophila  melanogaster  was  fol- 
lowed. Each  population  was  initiated  with  16  bw75/bw  ( bw  = brown  eyes) 
heterozygotes  and  maintained  at  a constant  size  of  16  individuals  by  ran- 
domly choosing  eight  males  and  eight  females  to  produce  the  next  genera- 
tion. Each  histogram  in  Figure  7.4  gives  the  number  of  populations 
containing  0, 1, 2, . . . , 32  bw'5  alleles.  The  pattern  of  change  in  allele  frequen- 
cy in  Figure  7.4  may  at  first  appear  to  be  complicated,  but  in  reality  a simple 
thing  is  happening.  The  initially  humped  distribution  of  allele  frequency 
gradually  becomes  flat  as  populations  fixed  for  bw75  or  bw  begin  to  pile  up  at 
the  boundaries.  The  piling  up  occurs  because,  once  an  allele  has  been  fixed  or 
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Figure  7.4  Random  genetic  drift  in  107  actual  populations  of  Drosophila 
melanogaster . Each  of  the  initial  107  populations  consisted  of  16  bw75 /bw  het- 
erozygotes (N  = 16;  bw  = brown  eyes).  From  among  the  progeny  in  each  genera- 
tion, eight  males  and  eight  females  were  chosen  at  random  to  be  the  parents  of 
the  next  generation.  The  horizontal  axis  of  each  curve  gives  the  number  of  bw75 
alleles  in  the  population,  and  the  vertical  axis  gives  the  corresponding  number 
of  populations.  (Data  from  Buri  1956.) 


lost,  it  remains  fixed  or  lost  since  mutation  is  negligible  over  such  a small 
number  of  generations  in  small  populations.  After  19  generations,  most  of  the 
populations  are  fixed  for  one  allele  or  the  other,  and  among  the  unfixed  pop- 
ulations, the  distribution  of  allele  frequencies  is  essentially  flat. 


PROBLEM  7.2  Consider  a self-pollinating  plant  population  consist- 
ing of  a single  heterozygous  ( Aa ) individual  on  a small  barren  island. 
Suppose  the  plant  reproduces  and  dies,  so  that  the  generations  are 
discrete,  and  the  population  can  only  consist  of  a single  plant.  What  is 
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the  probability  that  the  population  is  homozygous  at  this  genetic 
locus  by  the  second  generation? 


ANSWER  The  chance  that  the  first  generation  offspring  is  AA  is  V4 
and  the  chance  that  it  is  aa  is  also  V4,  so  the  chance  of  fixation  in  one 
generation  is  l/2.  If  the  first  generation  offspring  is  Aa,  then  the  proba- 
bility of  fixation  in  the  second  generation  (given  that  the  population  is 
not  fixed  in  the  first  generation)  is  again  V2.  The  probability  of  not  fix- 
ing in  generation  1 and  then  fixing  in  generation  2 is  y2  x V2  = V4.  Add 
to  this  the  chance  of  fixing  in  one  generation  and  we  get  3/4  as  the 
probability  of  fixation  by  two  generations.  Note  that  the  probability  of 
not  going  to  fixation  each  generation  is  V2/  and  so  the  chance  of  not 
fixing  for  two  generations  is  V2  x V2  = V4. 


Consider  an  infinitely  long  bowling  alley  with  minor  imperfections  that 
displace  the  ball  one  way  and  the  other.  The  gutters  represent  the  fixation 
states  of  p = 0 and  p = 1.  Once  the  ball  goes  in  the  gutter,  it  cannot  get  out 
again.  The  imperfections  keep  the  ball  from  rolling  in  a straight  line,  and 
eventually  it  rolls  into  the  gutter.  In  this  analogy,  the  size  of  the  population 
corresponds  to  the  width  of  the  bowling  alley;  a larger  population  implies  a 
wider  alley.  The  imperfections  still  deflect  the  ball  but,  in  proportion  to  the 
width  of  the  alley,  the  ball's  zigs  and  zags  are  of  a smaller  magnitude.  Conse- 
quently, the  ball  remains  out  of  the  gutter  for  a longer  time,  analogous  to  the 
longer  time  to  fixation  for  a larger  population.  But  just  as  certainly,  the  ball 
will  eventually  land  in  the  gutter. 


THE  WRIGHT-FISHER  MODEL  OF  RANDOM  GENETIC  DRIFT 

Fisher  (1930)  and  Wright  (1931)  both  considered  the  consequences  of  the  sort 
of  binomial  sampling  that  occurs  in  small  populations  when  the  sampling 
occurs  repeatedly  over  many  generations.  This  model,  known  as  the  Wright- 
Fisher  model,  derives  the  distribution  of  allele  frequencies  among  popula- 
tions undergoing  random  genetic  drift.  Although  neither  Fisher  nor  Wright 
formulated  the  problem  in  terms  of  matrices,  as  used  here,  this  approach 
makes  the  problem  much  simpler  and  gives  the  same  results.  If  a population 
has  2 N genes,  and  there  are  two  alleles  (A  and  a)  that  may  be  segregating, 
then  the  state  of  the  population  can  be  described  by  the  number  of  A alleles 
in  the  population.  The  possible  states  are  then  0, 1,  2, ... , 2 N.  The  states  0 and 
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2 N are  special  in  that  these  are  fixations,  and  once  the  population  gets  into 
these  states,  it  cannot  leave.  The  states  0 and  2 N are  called  absorbing  states. 
From  any  other  allele  frequency,  it  is  possible  for  the  population  to  drift  to  a 
different  allele  frequency.  However,  to  use  an  example  from  Figure  7.4,  if 
2N  = 32,  then  the  chance  of  drifting  in  one  generation  from  30  copies  of  gene 
A to  29  copies  of  gene  A is  greater  than  the  chance  of  drifting  to  two  copies. 
The  probability  of  the  population  drifting  from  the  state  having  i copies  to  j 
copies  of  allele  A is  known  as  the  transition  probability.  The  transition  prob- 
ability for  the  Wright-Fisher  model  is  obtained  directly  from  the  binomial 
distribution.  If  a population  has  i copies  of  allele  A,  then  the  allele  frequency 
is  p = i/2N,  and  the  frequency  of  allele  a is  q = 1 - i/2N.  The  probability  of 
going  from  i copies  of  A to  ; copies  of  A in  one  generation  is: 


T„ 


v j ) 


p’q 


2 N-i 


7.2 


The  transition  probabilities  can  be  put  in  a square  matrix  T,  with  elements 
Tjj  giving  the  transition  probability  from  state  i to  state;  for  i,j  = 0,1,2, ...  ,2  N. 
The  matrix  T contains  everything  that  is  needed  to  predict  the  expected  distri- 
bution of  populations  like  those  in  Figure  7.4  over  a series  of  generations.  This 
type  of  model,  expressed  in  terms  of  discrete  states  with  fixed  probabilities  of 
going  from  one  state  to  another,  is  known  as  a Markov  chain,  and  it  has  some 
very  elegant  mathematical  properties.  Iterations  of  the  Wright-Fisher  model 
give  the  expected  outcome  of  a pure  drift  process  (Figure  7.5).  We  will  use  the 
Wright-Fisher  model  only  to  show  one  aspect  about  fixation  probabilities. 


PROBLEM  7.3  Consider  a population  of  four  diploid  individuals. 
Calculate  the  probability  that  a population  with  four  copies  of  allele  A 
(allele  frequency  p = y2)  drifts  in  one  generation  to  having  three 
copies.  What  is  the  probability  that  the  population  has  four  copies  of 
A?  Five  copies?  Now  consider  a population  of  the  same  size,  but  ini- 
tially with  two  copies  of  A.  What  is  its  probability  of  drifting  to  one, 
two,  or  three  copies? 


ANSWER  Applying  Equation  7.2,  we  get  T43  = [8! /(5!3!)](V2)8  = 7/32 
= 0.219.  T44  = [8!/(4!4!>]  (V2)8  = 70/256  = 0.273.  T4,5  = T4,3  = 0.219.  (Note 
that  the  binomial  distribution  is  symmetric  when  p = i/2,  so  there  is 
equal  probability  for  samples  that  are  symmetrically  divergent  from 
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p = V2.)  In  the  case  when  the  initial  frequency  is  %,  we  get 
T2, i = [8! / 1 !7!)]  (V4)(3/4)7  = 0.267,  T2,2  = [8!/(2!6!)]  (V4)W  = 0.311,  and 
T„  = [8!/(3!5!)]  (V^W  = 0.208. 


The  above  problem  illustrates  an  important  feature  of  the  Wright-Fisher 
model.  The  magnitude  of  change  in  allele  frequency  is  greater  when  the  allele 
frequency  is  */2  than  it  is  when  the  allele  frequency  is  more  skewed.  The 
changes  are  greater  because  the  variance  in  the  binomial  sampling  distribu- 
tion is  greatest  when  p = V2.  (The  formula  for  the  binomial  variance  is  pq/ 2N.) 
The  variance  drops  to  zero  at  p = 0 and  p = 1.  The  variance  formula  makes  it 
clear  that  a large  population  will  change  allele  frequency  more  slowly  than  a 
smaller  population  because  the  sampling  variance  varies  as  the  reciprocal  of 
population  size.  Furthermore,  the  probability  of  an  increase  in  allele  frequen- 
cy is  the  same  as  the  probability  of  a decrease  in  allele  frequency,  regardless 
of  the  allele  frequency.  The  process  of  drift  does  not  recognize  when  a popu- 
lation is  close  to  a fixation.  The  chance  of  drifting  up  in  frequency  is  always 
equal  to  the  chance  of  drifting  down  in  frequency,  regardless  of  the  current 
population  allele  frequency. 

Fisher  and  Wright  also  addressed  the  expected  time  to  fixation.  Since 
another  approach  yields  the  solution  to  this  problem  much  more  easily,  we 
will  consider  times  to  fixation  in  the  next  section. 


PROBLEM  7.4  Simulating  random  drift  can  be  a very  time-consum- 
ing proposition.  If  one  wants  to  simulate  a population  of  1000  indi- 
viduals for  1000  generations,  one  has  to  draw  106  random  numbers 
and  for  each  decide  whether  to  accept  or  reject  each  genotype.  Kimu- 
ra  (1980b)  came  up  with  a shortcut  that  relates  very  closely  to  how  the 
diffusion  approximation  works  (see  the  next  section).  The  trick  is  to 
use  the  recursion:  p'  = p + (21/  - l)V(3p<?/2N),  where  U is  a random 
number  uniformly  distributed  between  0 and  1.  Each  generation,  one 
picks  one  random  number  U,  and  a sample  realization  of  the  next 
generation's  allele  frequency  is  gotten  from  the  above  recursion.  Why 
does  this  approach  work?  (Hint:  The  variance  in  a uniform  distribu- 
tion is  the  square  of  the  range  divided  by  12.) 
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ANSWER  The  expression  2 U - 1,  where  U is  a number  between  0 
and  1,  gives  a value  from  -1  to  +1,  or  a range  of  2.  The  range  of 
(2U-  l)V(3pq/2N)  is  therefore  l4{3pq/2N).  Squaring  this  expression 
and  dividing  by  12,  the  variance  of  this  uniform  random  variable  is 
thus  pq/2N,  just  what  we  get  from  a binomial  sampling  distribution. 
Each  generation  the  allele  frequency  has  an  equal  chance  of  increas- 
ing or  decreasing,  and  the  variance  in  the  allele  frequency  change  is 
pq/2N.  Even  though  the  distribution  of  change  in  allele  frequency  is 
uniform  in  the  pseudosampling  simulation  instead  of  binomial  (as  it 
is  in  the  Wright-Fisher  model),  this  process  can  reproduce  most  of  the 
results  of  the  complete  brute-force  simulation  at  a tiny  fraction  of  the 
computer  time. 


THE  DIFFUSION  APPROXIMATION 

The  pattern  of  change  in  allele  frequency  shown  in  Figure  7.4  is  very  nearly 
that  expected  theoretically  for  an  ideal  population,  and  although  the  full- 
blown theory  of  random  genetic  drift  requires  mathematics  beyond  the  scope 
of  this  book,  some  background  might  be  of  interest  (see  Kimura  1955, 1964, 
1976;  Wright  1969;  Crow  and  Kimura  1970;  Kimura  and  Ohta  1971).  The  rep- 
resentation of  random  drift  by  a differential  equation  was  first  applied  by 
Fisher  (1922),  who  noted  that  the  equation  describing  the  diffusion  of  heat 
through  a solid  bar  applies  to  random  genetic  drift.  The  distribution  of  popu- 
lations with  allele  frequencies  ranging  from  0 to  1 is  called  <p(x,t),  where  x rep- 
resents the  allele  frequency  and  t indicates  time.  Figure  7.5  shows  a particular 
realization  of  <t>(x,t)  changing  through  time.  The  theoretical  problem  is  to  for- 
mulate an  equation  that  describes  how  <t>(x,t)  changes  under  random  genetic 
drift,  and  to  solve  the  equation. 

The  parallel  between  the  physical  process  of  diffusion  and  what  is  actual- 
ly going  on  in  a finite  population  is  a bit  abstract,  but  it  is  not  particularly  dif- 
ficult. Consider  an  axis  of  allele  frequency  extending  from  0 to  1.  The  number 
of  populations  whose  frequency  is  between  x and  x + dx  at  time  t is  our  prob- 
ability density  <p{x,t).  Populations  may  enter  this  range  of  allele  frequencies 
by  drifting  in  from  another  frequency,  which  occurs  with  a probability  flux 
/(x,f).  Populations  may  leave  this  range  of  allele  frequencies  by  drifting  out, 
which  occurs  with  probability  flux  J(x  + dx,t).  The  rate  of  change  in  (j)(x,t)  is 
the  difference  in  these  fluxes,  which  we  can  write 
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Figure  7.5  Prediction  of  the  Wright-Fisher  model  for  the  distribution  0(x,t)  of 
populations  of  size  N = 16  with  allele  frequency  x at  generation  t,  for  20  genera- 
tions after  an  initial  frequency  of  0.5.  The  values  of  <p(x,t)  were  generated  using 
the  Markov  transition  probability  matrix,  whose  terms  are  given  by  the  binomi- 
al distribution.  The  model  with  2 N = 32  predicts  that  fewer  populations  have 
fixed  by  generation  19  than  actually  did  go  to  fixation  in  the  experiment  in  Fig- 
ure 7.4.  This  is  because  the  effective  population  size  is  smaller  than  the  observed 
count  (see  Figure  7.12). 


3_ 

dt 


</>(x't)  = —^-J{xrt) 
dx 


7.3 


because  J(x,t)  - J(x  + dx,t)  = - — J(x,t)  when  dx  is  very  small. 

ox 


The  probability  flux  is 


1 a 


J(x,  t ) = M{x)<p(x,  t)  - - — V(x)0(x,  t) 


7.4 


where  M(x)  is  the  average  change  in  allele  frequency  in  a population  whose  cur- 
rent allele  frequency  is  x,  and  V(x)  is  the  variance  in  change  in  allele  frequency. 
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M(x)  is  zero  unless  there  is  some  force,  like  mutation  or  selection,  driving  the 
allele  frequency  to  change  in  a particular  direction.  (Remember  that  with  pure 
drift,  allele  frequency  increases  and  decreases  with  equal  chance.)  V(x ) tells  how 
fast  allele  frequencies  change  for  a population  with  frequency  x.  Under  the 
Wright-Fisher  model,  V(x)  - x(l  - x)/2N,  which  means  that  the  binomial  sam- 
pling variance  describes  the  magnitude  of  allele  frequency  change.  But  the  prob- 
ability flux  depends  on  the  difference  in  rates  of  change  from  x to  x + dx,  and  so, 
just  as  in  classical  physical  models  of  diffusion,  the  flux  depends  on  the  gradi- 
ent in  whatever  is  diffusing.  In  the  case  of  chemical  diffusion,  this  gradient 
would  be  the  gradient  in  concentrations,  and  a greater  difference  in  concentra- 
tions would  yield  higher  flux.  In  the  case  of  our  population  genetic  model,  the 
gradient  is  the  change  in  sampling  variance  as  x is  varied,  or  dV(x)/dx. 

Substituting  Equation  7.4  into  Equation  7.3,  we  get 


Jt  f ) = ~ £-  M(x)f{x,  O-IA. ( x)4>(x , t ) 


7.5 


Equation  7.5  is  known  as  the  diffusion  equation,  the  forward  Kolmogorov 
equation,  or,  in  the  context  of  the  physics  of  heat  diffusion,  the  Fokker-Planck 
equation.  For  the  Wright-Fisher  model,  M(x)  = 0 and  V(x)  = x(l  - x)/2N  so 
we  get 


7.6 


Many  aspects  of  this  problem  were  explored  by  Wright  (1931),  and  the 
formal  solution  to  this  equation,  found  by  Kimura  (1955),  required  some 
heavy  mathematics.  For  our  purposes,  some  graphs  will  illustrate  the  impor- 
tant properties  of  the  diffusion  equation.  The  two  families  of  curves  in  Figure 
7.6  are  the  theoretical  distributions  <j)(x,t)  of  allele  frequency  among  unfixed 
populations  after  various  times  (f)  measured  in  units  of  N generations.  In  Fig- 
ure 7.6A,  all  populations  have  an  initial  allele  frequency  of  y2,  as  in  the  actual 
populations  in  Figure  7.4;  after  about  t = 2 N generations,  the  distribution  of 
allele  frequency  is  essentially  flat,  and  by  this  time  about  half  the  populations 
are  still  unfixed.  The  distributions  in  Figure  7.6  refer  only  to  those  popula- 
tions that  are  unfixed;  as  time  goes  on,  more  and  more  of  the  populations 
become  fixed,  and  the  distributions  progressively  pile  up  at  0 and  1,  as  in  the 
histograms  in  Figure  7.4.  Indeed,  in  Figure  7.6,  the  area  under  each  curve  is 
equal  to  the  proportion  of  unfixed  populations,  which  becomes  progressive- 
ly smaller.  In  particular,  the  rate  at  which  the  height  of  the  distribution 
decreases  once  it  becomes  flat  is  about  1/2N  per  generation. 
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(A) 


Figure  7.6  Theoretical  results  of  random  genetic  drift.  (A)  Initial  allele 
frequency  = V2.  (B)  Initial  allele  frequency  = 0.1.  The  curves  have  been  scaled  so 
that  the  area  under  each  curve  is  equal  to  the  proportion  of  populations  in  which 
fixation  or  loss  has  not  yet  occurred.  The  curves  are  therefore  the  distributions  of 
allele  frequencies  among  segregating  populations.  (From  Kimura  1955.) 


Figure  7.6B  shows  what  happens  when  the  initial  allele  frequency  is  0.1; 
here  the  distributions  are  highly  asymmetrical,  and  the  distribution  of  allele 
frequency  does  not  become  flat  until  about  t = 4N  generations,  by  which  time 
only  about  10%  of  the  populations  remain  unfixed.  Once  a flat  distribution  of 
allele  frequency  is  reached,  the  distribution  remains  flat,  but  random  drift 
continues  until  fixation  or  loss  has  occurred  in  all  populations. 


PROBLEM  7.5  Demonstrate  from  Equation  7.6  that  the  fixation 
states,  x = 0 and  x = 1,  are  equilibria  of  the  diffusion  process. 


ANSWER  A condition  for  equilibrium  is  that  <p(x,t)  remains  station- 
0 

ary,  so  that  — (f>(x,t)  = 0.  Substituting  into  Equation  7.6,  we  get 
ot 
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,j_y_ 

4N  dx2 


[x{\  -x)<p(x,t)].  If  x = 0 or  x = 1,  this  equality  is  clearly 


satisfied,  because  — y [0]  = 0.  It  requires  a bit  more  work  to  show  that 
these  are  the  only  equilibria. 


To  illustrate  that  the  diffusion  approximation  and  the  Wright-Fisher 
model  give  very  similar  results.  Figure  7.7  shows  the  diffusion  approximation 
for  the  data  in  Figure  7.4,  with  2 N = 32,  x0  = Vi,  and  t running  from  generation 
1 through  generation  19. 


19 


Figure  7.7  Kimura's  (1955)  solution  to  the  diffusion  equation  for  the  particu- 
lar case  of  N = 16.  This  is  the  three-dimensional  view  of  Figure  7.6,  and  repre- 
sents the  diffusion  approximation  to  the  exact  solution  obtained  by  the 
Wright-Fisher  model  in  Figure  7.5. 
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Absorption  Time  and  Time  to  Fixation 

One  useful  application  of  the  diffusion  approximation  has  been  to  determine 
expressions  for  the  expected  time  for  a neutral  allele  to  go  to  fixation.  Assum- 
ing that  the  allele  starts  at  frequency  p,  Kimura  and  Ohta  (1969)  showed  that 
the  mean  time  (in  generations)  until  the  allele  is  fixed  (ignoring  cases  where 
the  allele  is  lost)  is 


*i(p)  = -y-[(1-p)1°g(i-p)]  7.7 

Similarly,  they  showed  that  the  mean  time  to  loss  of  the  allele  is 

*o  (p)  = -^[PlogM]  7.8 

Combining  Equations  7.7  and  7.8,  the  mean  persistence  time  of  an  allele  is 

F (p)  = -4  N[p  log  (p)  + (1  - p)  log(l  - p)}  7.9 

where  T(p)  is  the  average  time  that  an  allele  remains  segregating  in  a popula- 
tion (that  is,  until  its  frequency  is  either  0 or  1). 

The  average  times  for  fixation,  loss,  and  persistence  of  a neutral  allele 
are  shown  graphically  in  Figure  7.8.  An  allele  is  expected  to  remain  in  a 
population  for  the  longest  time  when  its  initial  frequency  is  y2.  When 


Figure  7.8  Average  persistence  of  a neutral  allele  in  an  ideal  diploid  popula- 
tion of  size  N,  plotted  against  initial  allele  frequency. 
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Po  - xh,  the  average  time  that  a population  remains  unfixed  is  about  2.77N 
generations. 


PARALLELISM  BETWEEN  RANDOM  DRIFT  AND  INBREEDING 

Consider  a set  of  four  subpopulations  each  started  with  allele  frequency  p = 
y2,  and  each  undergoing  random  drift  independently  following  binomial  sam- 
pling (Figure  7.9).  Within  any  particular  subpopulation  (call  it  subpopulation 


Initial  After  1.39N  generations  After  fixation 


Figure  7.9  A schematic  diagram  showing  a set  of  four  populations  undergo- 
ing the  process  of  drift.  Initially  the  allele  frequency  is  V2  in  all  four  populations, 
and  the  average  heterozygosity  is  V2.  As  the  populations  drift  in  allele  frequency, 
the  average  is  expected  to  remain  the  same  (indicated  by  p remaining  V2)  but  the 
average  heterozygosity  decreases.  (Genotype  frequencies  are  given  for  the  inter- 
mediate generation.)  Finally,  all  populations  go  to  fixation,  half  fix  one  allele 
and  half  fix  the  other,  so  the  average  allele  frequency  is  still  V2,  but  the  hetero- 
zygosity is  zero. 
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number  i),  mating  is  random  because  all  the  assumptions  in  Table  7.1  hold 
true.  If  the  allele  frequencies  of  A and  a in  the  zth  subpopulation  are  denoted  p, 
and  q„  then  the  genotype  frequencies  of  AA,  Aa,  and  aa  are  given  by  the  famil- 
iar Hardy-Weinberg  principle  as  p,1 2 3 4 5 6 7 8,  2p,q,  and  q2.  Furthermore,  picture  the  sit- 
uation in  Figure  7.9  at  a time  so  advanced  that  all  subpopulations  are  fixed  for 
one  allele  or  the  other.  Within  the  zth  subpopulation,  therefore,  either  p,  equals 
0 or  pi  equals  1.  The  genotype  frequencies  of  AA,  Aa,  and  aa  in  that  subpopu- 
lation are  either  0,  0,  and  1 (if  p,  = 0),  or  1,  0,  and  0 (if  p,  = 1).  These  genotype 
frequencies,  though  extreme,  still  satisfy  the  Flardy-Weinberg  principle.  Thus, 
within  any  one  subpopulation  in  Figure  7.9,  the  frequency  of  heterozygotes  is 
that  expected  with  random  mating. 

The  situation  regarding  the  total  population  in  Figure  7.9  is  very  different, 
however,  as  there  is  an  overall  deficiency  of  heterozygotes.  Suppose  that  we 
sample  the  four  subpopulations,  but  that  we  are  unaware  of  the  existence  of 
the  four  subpopulations,  and  instead  we  think  that  the  sample  contains  a sin- 
gle randomly  mating  population.  Considering  the  four  populations  at  the 
right  side  of  Figure  7.9  (after  all  variation  is  lost),  if  we  were  to  calculate  the 
allele  frequency  we  would  obtain  p = y2.  We  would  then  naively  expect  a frac- 
tion 2pq  = i/2  of  the  genotypes  to  be  heterozygous.  In  fact,  we  would  have  no 
heterozygotes  at  all  in  our  sample!  This  rather  paradoxical  result — that  there 
is  a deficiency  of  heterozygotes  in  the  total  population  even  though  random 
mating  occurs  within  each  subpopulation — is  a consequence  of  the  random 
genetic  drift  of  allele  frequencies  among  subpopulations  due  to  their  finite 
size.  This  extreme  case  when  each  subpopulation  is  fixed  is  easy  to  under- 
stand: a population  with  allele  frequency  V2  could  only  be  made  up  of  two 
subpopulations  each  fixed  for  A,  and  two  subpopulations  each  fixed  for  a. 
The  entire  population  has  no  heterozygotes  whatsoever,  but  the  average 
allele  frequency  is  V2.  The  total  population  has  a deficiency  of  heterozygotes, 
much  as  if  there  were  inbreeding.  This  inbreeding-like  effect  of  population 
subdivision  is  known  as  the  Wahlund  principle  (Chapter  4),  and  we  are  now 


TABLE  7. 1 ASSUMPTIONS  OF  MODEL  OF  RANDOM  GENETIC  DRIFT 


(1)  Diploid  organism 

(2)  Sexual  reproduction 

(3)  Nonoverlapping  generations 

(4)  Many  independent  subpopulations,  each  of  constant  size  N 

(5)  Random  mating  within  each  subpopulation 

(6)  No  migration  between  subpopulations 

(7)  No  mutation 

(8)  No  selection 
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Figure  7.1 0 Diagram  illustrating  the  reasoning  behind  the  recursion  for  F in  a 
finite  population.  When  the  gametes  are  drawn  to  make  up  the  population  at 
generation  f,  there  is  a chance  1/2N  that  any  pair  of  alleles  will  be  drawn  in 
generation  f - 1.  If  this  happens,  the  probability  of  identity  is  1.  For  the  allele 
pairs  drawn  in  generation  t from  two  distinct  alleles  at  generation  t - 1 (the 
probability  of  this  is  1 - 1/2N),  the  probability  of  identity  is  FM.  Adding  the 
probabilities  of  these  two  events,  we  get  F,  = 1/2 N + (1  - 1/2N)F(_1. 


in  a position  to  quantify  the  manner  in  which  subpopulations  diverge  in 
allele  frequency  under  random  genetic  drift. 

In  Chapter  4 we  measured  the  extent  of  inbreeding  with  the  inbreeding 
coefficient,  F.  F is  the  probability  of  autozygosity,  or  the  probability  that  an 
individual  carries  a pair  of  alleles  that  are  identical  by  descent  (derived  from 
a common  ancestor).  Even  though  random  mating  occurs  within  each  sub- 
population in  Figure  7.9,  because  gametes  do  combine  at  random,  any  two 
alleles  in  a subpopulation  may  be  identical  by  descent  due  to  the  limited  pop- 
ulation size.  Thus  F,  does  not  equal  zero.  The  value  of  F,  can  be  calculated  as 
in  Figure  7.10.  This  figure  shows  the  2 N alleles  in  a breeding  population  of 
generation  t - 1.  In  sampling  alleles  for  generation  f,  the  first  chosen  allele 
may  be  any  of  those  present  in  generation  t - 1 with  equal  chance.  The  prob- 
ability that  the  second  chosen  allele  is  of  the  same  type  as  the  first  is  1 /IN, 
because  this  is  the  frequency  of  each  allelic  type  in  the  gametic  pool;  the 
probability  that  the  second  chosen  allele  is  of  a different  type  from  the  first  is 
accordingly  1 - 1 / IN.  In  the  first  case,  the  probability  of  identity-by-descent 
is  1;  in  the  second  case  it  is  F,_v  Altogether  the  recursion  is 


F,= 


7.10 


Multiplying  both  sides  by  -1  and  adding  1 leads  to 


and  so 


1 -Ft 


(1-Fo) 


1-Ff  = 1 


7.11 
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Figure  7.11  Increase  of  F,  in  ideal  populations  as  a function  of  time  and  effec- 
tive population  size  N. 


or,  when  F0  = 0, 


F, 


7.12 


Figure  7.11  shows  the  rapid  increase  of  F,  in  small  populations.  Another 
aspect  of  the  same  phenomenon  can  be  appreciated  by  the  probability  of 
drawing  a pair  of  alleles  that  are  not  identical  by  descent.  This  probability  is 
the  same  as  the  heterozygosity,  and  it  can  be  written 


H,  = l-  F,  7.13 

By  substitution  for  F,  we  obtain  the  rate  of  change  in  heterozygosity  from 
random  genetic  drift 


H,= 


7.14 
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and  so 


H'=(1_2^)HoS=H°^f/2N  7.15 

Recall  again  that  a single  population  undergoing  random  drift  remains 
in  approximate  Hardy-Weinberg  proportions,  and  that  the  symbol  Ht  rep- 
resents a sort  of  "virtual  heterozygosity"  averaged  across  many  subpopu- 
lations. The  above  equations  show  that  pure  random  drift  should  result  in 
the  heterozygosity  decreasing  at  a geometric  rate,  since  H,  is  multiplied  by 
the  constant  (1  - 1 /IN)  each  generation.  Experimental  tests  of  this  predic- 
tion are  shown  in  Figure  7.12.  Figure  7.12A  shows  how  the  heterozygosity 
averaged  across  the  populations  in  Figure  7.4  declines  over  generations, 
but  the  theoretical  curve  when  N = 16  does  not  fit  the  data  very  well. 
In  fact,  the  rate  of  decline  of  heterozygosity  is  greater  than  the  theoretical 
expectation,  as  though  the  population  size  were  smaller  than  N = 16.  On 
the  other  hand,  the  allele  frequency,  averaged  across  populations,  is  not 
expected  to  change,  and  the  data  agree  with  this  aspect  of  the  theory  quite 
well  (Figure  7.12B). 


PROBLEM  7.6  Use  Equation  7.15  to  determine  how  long  it  takes  for 
a finite  population  to  halve  in  heterozygosity. 


ANSWER  Set  l/2H0  = H0  e~(l/2N).  Dividing  out  the  H0,  and  taking  log- 
arithms, we  obtain 


so  f = 1.39 N.  In  other  words,  it  takes  1.39 N generations  to  halve  the 
heterozygosity,  regardless  of  its  initial  value.  Fisher  expressed  this 
result  by  saying  that  it  takes  1.39 N generations  to  halve  the  genic  vari- 
ance in  the  population.  Since  the  variance  of  a binomial  sample  is 
pq/2N,  and  the  variance  in  allele  frequency  among  subpopulations  is 
proportional  to  the  heterozygosity,  it  follows  that  both  the  variance 
and  the  heterozygosity  decrease  at  the  same  rate. 
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Generation  (t) 


Figure  7.1 2 Theoretical  curves  for  average  heterozygotes  (A)  with  N = 9 or 
N = 16,  along  with  actual  values  (plotted  as  points)  from  the  experiment  in  Fig- 
ure 7.4.  In  (B)  the  observed  and  expected  allele  frequencies  (averaged  across  the 
107  subpopulations)  are  plotted.  (Data  from  Buri  1956.) 


Several  important  consequences  of  the  population  structure  in  Figure  7.9 
can  now  be  summarized.  First,  although  each  subpopulation  is  finite  in  size, 
we  can  imagine  so  many  of  them  that  the  size  of  the  total  population  is  effec- 
tively infinite.  For  an  infinite  population  that  obeys  the  assumptions  in  Table 
7.1,  the  allele  frequencies  must  remain  constant.  That  is,  even  though  the 
allele  frequency  in  any  individual  subpopulation  may  change  willy-nilly  due 
to  random  genetic  drift,  the  overall  average  allele  frequency  of  A among  sub- 
populations remains  p0,  where  p0  represents  the  allele  frequency  of  A in  the 
base  population.  Figure  7.12B  gives  an  experimental  demonstration  of  the 
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constancy  of  average  allele  frequency.  Since  F,  is  the  probability  of  autozy- 
gosity of  a gene  in  an  individual  in  generation  t,  the  probability  of  allozygos- 
ity  (obtaining  a pair  of  alleles  that  are  not  identical  by  descent)  is  1 - F,. 
Because  p0  is  the  overall  allele  frequency  of  A,  the  probability  that  a random- 
ly chosen  individual  will  be  genotypically  AA  is  p£(l  - Ft)  [for  the  case  of 
allozygosity]  + p0F,  [for  the  case  of  autozygosity].  Similarly,  the  probability 
that  the  individual  will  be  Aa  equals  2p0q0(l  - Ft);  and  the  probability  that  the 
individual  will  be  aa  equals  cjg(l  - Ft)  + cjgFt.  Note  that  the  genotypic 
frequencies  in  the  total  population  are  different  from  the  standard  Hardy- 
Weinberg  proportions,  because  there  is  an  apparent  excess  of  homozygotes. 
However,  within  any  one  subpopulation,  the  genotypic  frequencies  still  obey 
the  Hardy- Weinberg  principle  because  of  random  mating.  Substituting  for 
F,  in  Equation  7.12  implies  that  the  average  heterozygosity  among  subpopu- 
lations at  time  t equals  2p0q0{l  - Ft)  = 2p0q0(l  - 1/2 N)';  this  is  the  theoretical 
curve  plotted  in  Figure  7.12A  (with  p0  = q0=  V2). 

Since  F,  eventually  goes  to  1,  all  subpopulations  eventually  become 
fixed  for  one  allele  or  the  other.  Because  the  average  allele  frequency  of  A 
remains  p0  even  when  all  subpopulations  have  become  fixed,  the  proportion 
of  subpopulations  that  eventually  become  fixed  for  A must  be  p0  (and  the 
proportion  that  eventually  become  fixed  for  a must  be  q0).  Stated  another 
way,  the  probability  of  ultimate  fixation  of  an  allele  in  any  ideal  subpopu- 
lation is  equal  to  the  frequency  of  that  allele  in  the  initial  population.  This 
point  is  illustrated  by  the  actual  example  in  Figure  7.4,  where  p0  = V2;  by 
generation  19,  a total  of  58  populations  have  become  fixed,  30  for  the  biu 
allele  and  28  for  biu75. 


EFFECTIVE  POPULATION  SIZE 

As  we  saw  in  the  Drosophila  experiments  in  Figure  7.12,  populations  general- 
ly fluctuate  in  allele  frequency  by  an  amount  greater  than  pq/2N.  The  reason 
is  that  no  real  population  obeys  all  the  assumptions  in  Table  7.1  exactly.  In 
any  actual  case,  there  must  be  corrections  for  such  complications  as  fluctua- 
tions in  population  size,  unequal  numbers  of  males  and  females,  age  struc- 
ture, and  skewed  distributions  in  family  size  (see  Crow  and  Kimura  1970). 
The  degree  to  which  genetic  drift  can  change  allele  frequencies,  and  the  rates 
of  allele  fixation  by  drift,  can  be  approximated  under  these  complicating  cir- 
cumstances by  calculating  the  effective  size  of  the  population  and  using  this 
value  in  the  theory  for  an  ideal  population.  That  is,  the  effective  population 
size  of  an  actual  population  is  the  number  of  individuals  in  a theoretically 
ideal  population  having  the  same  magnitude  of  random  genetic  drift  as  the 
actual  population.  There  are  three  kinds  of  effective  population  size  based  on 
how  we  choose  to  measure  “magnitude,”  namely:  (1)  the  change  in  average 
inbreeding  coefficient,  (2)  the  change  in  variance  in  allele  frequency,  or  (3)  the 
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rate  of  loss  of  heterozygosity.  These  are  called  the  inbreeding  effective  size,  the 
variance  effective  size,  and  the  eigenvalue  effective  size,  respectively. 

Wright  (1931)  first  worked  out  the  effective  population  size  by  considering 
the  effective  degree  of  inbreeding  in  various  situations.  As  noted,  the  effective 
population  size  can  also  be  calculated  by  determining  the  rate  of  change  in 
variance  in  a population,  and  Kimura  and  Crow  (1963)  first  applied  this 
approach  to  the  problem  of  overlapping  generations.  Usually,  the  inbreeding 
effective  size  and  the  variance  effective  size  are  the  same,  but  exceptions  do 
occur.  Similarly,  the  variance  effective  size  and  the  eigenvalue  effective  size 
can  be  distinct  (Ewens  1982).  Some  of  the  various  factors  that  require  calcula- 
tion of  an  effective  population  size  will  now  be  illustrated.  We  will  focus  on 
the  inbreeding  effective  size  because  this  concept  is  the  most  widely  used. 


Fluctuation  in  Population  Size 

Correction  for  fluctuating  population  size  is  important  because  natural  pop- 
ulations actually  do  change  in  size,  sometimes  by  a factor  of  10  or  more  in  a 
single  generation.  For  the  sake  of  simplicity,  assume  that  the  population  is 
ideal  in  all  respects  except  that  its  size  is  not  constant.  We  will  consider  the 
situation  over  just  two  generations.  Suppose  that  the  population  sizes  in  two 
successive  generations  are  N0  and  Nj.  The  arguments  laid  out  in  Figure  7.10 
imply  that 


and 


1-F,  = 


2 N, 


7.16 


1-F, 


2 N0, 


(1 


7.17 


Substituting  from  the  second  equation  into  the  first  leads  to 


1-F2 


1 ' 
2 Nly 


1 

2N0 


P-fo) 


7.18 


By  analogy  with  the  constant  N case,  it  is  appropriate  to  try  to  express  this 
equation  in  the  general  form 


where  N is  now  the  effective  population  size.  In  our  example  t = 2,  so 


7.19 


7.20 
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Setting  the  two  expressions  for  1 - F2  equal  to  each  other  we  obtain 


7.21 


from  which  1/N  = V2(l/N0  + 1 /N{)  turns  out  to  be  an  excellent  approxima- 
tion. In  general. 


1 if  1 1 1 

— = - - — + — + ■••  + — 


7.22 


Ne  t Ni  Nt_i  j 


and  so  the  effective  size  Ne  is  the  harmonic  mean  of  the  actual  numbers — 
the  reciprocal  of  the  average  of  reciprocals.  As  illustrated  in  the  problem 
below,  the  harmonic  mean  tends  to  be  dominated  by  the  smallest  terms.  In 
biological  reality,  this  means  that  a single  period  of  small  population  size, 
called  a bottleneck,  can  result  in  a serious  loss  in  heterozygosity.  Popula- 
tion bottlenecks  are  thought  to  account  for  the  very  low  levels  of  polymor- 
phism found  in  extant  populations  of  the  elephant  seal  (Bonnell  and 
Selander  1974)  and  the  cheetah  (O'Brien  et  al.  1985, 1987).  A severe  popula- 
tion bottleneck  often  occurs  in  nature  when  a small  group  of  emigrants 
from  an  established  subpopulation  founds  a new  subpopulation;  the 
accompanying  random  genetic  drift  is  then  known  as  a founder  effect  (see 
Holgate  1966;  Nei  et  al.  1975;  Chakraborty  and  Nei  1977;  Neel  and  Thomp- 
son 1978).  Founder  effects  in  human  populations  have  implications  in  med- 
ical genetics,  because  human  populations  derived  from  small  numbers  of 
founders  may  have  an  elevated  incidence  of  an  otherwise  rare  genetic  dis- 
order. Examples  include  Tay-Sachs  diseases  in  Ashkenazi  Jews,  diastrophic 
dystrophy  in  Finns,  familial  hyperchylomicronemia  in  Quebecois,  and  con- 
genital total  color  blindness  in  Pingelap  Islanders.  In  addition  to  reducing 
the  effective  size,  and  thereby  increasing  F,  population  bottlenecks  and 
founder  effects  may  affect  many  other  aspects  of  the  genetic  variation, 
including  causing  a reduced  number  of  alleles,  a distorted  distribution  of 
numbers  of  molecular  site  differences  among  alleles,  and  an  increased  level 
of  linkage  disequilibrium. 


PROBLEM  7.7  Suppose  a population  went  through  a bottleneck  as 
follows:  N0  = 1000,  Nj  = 10,  and  N2  = 1000.  Calculate  the  effective  size 
of  this  population  across  all  three  generations. 
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ANSWER  Using  Equation  7.22,  we  get  1/N  = (1/3)(1/1000  + 1/10 
+ 1/1000)  = 0.034,  or  N=  1/0.034  = 29.4.  The  average  effective  number 
over  the  three-generation  period  is  only  29.4,  whereas  the  arithmetic 
average  number  of  individuals  is  (1/3)(1000  + 10  + 1000)  = 670. 


Unequal  Sex  Ratio,  Sex  Chromosomes,  Organelle  Genes 

A second  important  case  in  which  the  effective  size  of  a nonideal  population 
can  readily  be  calculated  concerns  sexual  populations  in  which  the  number 
of  males  and  females  is  unequal.  This  inequality  creates  a peculiar  sort  of 
bottleneck";  because  half  of  the  alleles  in  any  generation  must  come  from 
each  sex,  any  departure  of  the  sex  ratio  from  equality  will  enhance  the  oppor- 
tunity for  random  genetic  drift.  This  situation  is  important  in  wildlife  man- 
agement, where,  for  many  game  animals  (pheasants  and  deer  come 
immediately  to  mind),  the  legal  bag  limit  for  males  is  much  larger  than  for 
females.  Although  some  management  goals  are  served  by  such  hunting  reg- 
ulations (for  example,  the  species  involved  are  usually  polygamous,  so  one 
male  can  fertilize  many  females  and  overall  actual  population  size  can  be 
maintained),  it  must  be  remembered  that  the  resultant  inequality  in  sex  ratio 
reduces  the  effective  population  size.  Specifically,  if  a sexual  population  con- 
sists of  Nm  males  and  Nf  females,  the  actual  size  is 


K = N,„  + Nf 

However,  the  effective  population  size  is 


K = 


4 NmNf 
Nm  + Nf 


7.23 


7.24 


Figure  7.13  shows  the  relationship  between  sex  ratio  and  the  reduction  in 
effective  population  size.  To  take  a realistic  example,  if  hunting  is  permitted 
to  a level  at  which  the  number  of  surviving  males  is  one-tenth  the  number  of 
females,  then  the  effective  population  size  is  a mere  one-third  of  the  actual 
number  of  individuals  in  the  population. 

A related  problem  is  the  effective  population  size  for  an  X-linked  gene.  In 
this  case,  the  variance  effective  population  size  is 
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ratio. 


Equation  7.25  can  be  justified  by  noting  that  the  sampling  variance  for  the 
X chromosomes  from  males  is  pmq„,/Nm,  whereas  the  sampling  variance  for  X 
chromosomes  from  females  is  pfCff/lNf,  in  which  pm  and  pf  are  the  frequencies 
of  allele  A in  males  and  females,  respectively.  The  frequency  of  an  /4-bearing 
X chromosome  in  the  population  is 


1 2 
P=3Pm  + 3Pf 


7.26 


and  the  sampling  variance  of  p is 


Var(p)  = 


1 

'Pm^m^  i 4 

[ Ptff ) 

9 

l Nm  ) 9 

\2Nfj 

7.27 


At  steady  state,  pm  = p/=  p,  so  pc]  can  be  factored  out,  giving 


/ 

Var(p)  = pq 

v 


1 1 4 1 

9Nm+92Nfy 


pq 

9N„,N/ 
4Nra  + 2Nf 


7.28 


The  term  in  the  square  brackets  corresponds  to  the  Ne  in  Equation  7.25.  It 
shows  why  this  is  a variance  effective  size:  the  binomial  sampling  variance  in 
an  ideal  population  is  pq/2[Ne ]. 
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PROBLEM  7.8  What  is  the  effective  population  size  for  mitochon- 
drial DNA?  (Assume  transmission  is  exclusively  from  mothers  to  all 
offspring.)  What  is  the  effective  population  size  for  a gene  on  the  Y 
chromosome,  given  that  the  population  consists  of  N diploid  individ- 
uals and  all  other  assumptions  of  Table  7.1  apply?  (Assume  XX  indi- 
viduals are  female  and  XY  individuals  are  male.) 


ANSWER  Mitochondrial  DNA  is  transmitted  essentially  exclusively 
by  females.  The  chance  of  drawing  two  mtDNAs  that  are  identical  by 
descent  is  1 /Nf,  where  Nf  is  the  number  of  females  in  the  population. 
Hence  the  effective  size  is  simply  Nf.  Similarly,  the  effective  popula- 
tion size  for  the  Y chromosome  is  N„„  the  number  of  males  in  the  pop- 
ulation. Note  that  even  though  mtDNA  is  present  in  all  individuals, 
while  the  Y is  present  only  in  males,  the  effective  size  of  mtDNA  is  not 
larger.  Effective  size  depends  on  the  sampling  properties  of  a gene, 
which  depends  on  the  gene's  transmission,  not  just  on  how  many 
individuals  carry  the  gene. 


BALANCE  BETWEEN  MUTATION  AND  DRIFT 

There  are  many  forces  in  population  genetics  that  act  in  opposition  to  one 
another,  and  it  is  this  tension  that  makes  for  interesting  behavior  at  the  pop- 
ulation level.  Mutation  always  increases  the  amount  of  genetic  variation  in  a 
population.  Random  genetic  drift  results  in  the  loss  of  genetic  variation. 
Merely  because  these  two  forces  are  in  opposition,  it  does  not  guarantee  that 
there  will  be  a stable  balance  between  them.  In  order  to  formally  ask  whether 
the  two  forces  do  balance,  we  need  to  be  careful  to  specify  assumptions  about 
the  processes  of  mutation  and  drift.  We  already  examined  one  such  model  in 
Chapter  5 — the  infinite-alleles  model — and  we  saw  that  in  this  case  the  forces 
do  in  fact  balance  to  provide  an  equilibrium  level  of  neutral  variation.  Let's 
consider  this  model  once  again,  in  somewhat  more  detail. 

INFINITE  ALLELES  MODEL 

As  we  saw  in  Chapter  5,  the  infinite  alleles  model  starts  with  the  assumption 
that  each  mutation  produces  a novel  allele,  never  before  present  in  the  popu- 
lation. Mutations  occur  such  that  each  gene  in  the  population  has  an  equal, 
but  low,  chance  of  mutating.  Random  genetic  drift  occurs  in  the  manner  of 
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the  Wright-Fisher  model — each  generation  the  population  is  reconstituted  by 
drawing  a sample  with  replacement  from  the  current  sample  of  alleles. 
Under  these  assumptions  we  saw  that  the  equilibrium  probability  of  identi- 
ty, F,  could  be  approximated  as 


F = — 7.29 

4Np  + 1 

The  number  of  selectively  neutral  alleles  increases  under  mutation  pres- 
sure until  F satisfies  this  equation. 


PROBLEM  7.9  Derive  an  expression  for  F in  a finite  population  with 
mutation  and  migration. 


ANSWER  First  assume  that  there  is  no  mutation,  and  that  new 
migrant  alleles  arrive  from  another  population  at  a rate  m per  genera- 
tion. As  in  the  balance  between  mutation  and  drift  in  the  infinite- 
alleles  model,  we  note  that  alleles  can  be  identical  by  descent  by  being 
drawn  twice  (with  probability  1/2 N)  or  by  having  two  different  alle- 
les drawn  but  having  them  be  IBD  from  the  previous  generation.  The 
equilibrium  autozygosity  can  be  written 


because  (1  - m)2  is  the  probability  that  neither  of  the  two  randomly 
chosen  alleles  comes  from  a migrant.  By  analogy  with  the  infinite- 
alleles  model,  we  get  (in  the  case  of  migration  with  no  mutation) 


4 Nm  + 1 


When  both  migration  and  mutation  are  occurring,  alleles  are  identical 
by  descent  only  if  they  neither  mutated  nor  migrated,  and  this  occurs 
with  probability  1 - m - p.  Thus,  the  equilibrium  autozygosity  is 


4N(m  + |i)  + 1 
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F is  the  probability  of  autozygosity,  and  H,  which  can  be  thought  of  as  het- 
erozygosity, is  also  the  probability  of  drawing  a pair  of  alleles  that  are  not 
autozygous.  Since  H = 1 - F,  under  the  infinite-alleles  model,  the  equilibrium 
of  H is 


H = = JL  7.30 

l + 4Np  9 + 1 

where  0 = 4 Np. 

The  relationship  between  the  quantity  4Np  and  H was  encountered  in 
Chapter  5 and  is  plotted  in  Figure  5.7.  For  a per-locus  mutation  rate  of  10~6 
and  a population  size  of  250,000,  we  get  0 = 1,  and  so  H = V2.  Note  that 
increases  in  population  size  have  precisely  the  same  effect  as  increases  in 
mutation  rate.  Heterozygosity  approaches  one  only  if  population  sizes  are 
very  large  (such  as  in  microbial  organisms)  or  if  mutation  rates  are  very  high 
(such  as  at  some  microsatellite  loci).  Next  we  will  consider  how  one  might  go 
about  testing  whether  a sample  from  a population  exhibits  a pattern  of  genet- 
ic variation  that  is  compatible  with  the  infinite-alleles  model. 

The  Ewens  Sampling  Formula 

The  infinite-alleles  model  has  an  "equilibrium"  when  H = 4 Np/ (1  + 4N|i). 
This  is  not  an  equilibrium  in  the  usual  sense.  In  reality,  allele  frequencies  are 
always  changing,  new  mutations  continue  to  come  into  the  population,  and 
eventually  they  are  eliminated,  even  perhaps,  after  becoming  fixed  for  some 
time.  The  term  steady  state  is  probably  more  appropriate  for  this  kind  of 
behavior,  since  the  alleles  are  not  maintained  at  a constant  frequency,  but 
rather  new  ones  are  entering  and  old  ones  are  leaving  the  population.  The 
population  remains  at  a steady  state  in  the  sense  that  the  number  of  alleles, 
and  the  level  of  autozygosity,  remain  stationary.  If  the  number  of  alleles  and 
the  level  of  autozygosity  remain  steady,  then  it  is  reasonable  to  assume  that 
there  is  also  a steady-state  distribution  of  allele  frequencies.  By  steady-state 
frequencies  we  mean  that  the  most  common  allele  always  has  a frequency  of 
pv  and  the  next  most  common  has  a frequency  of  p2,  and  so  on.  The  steady- 
state  distribution  has  the  curious  property  that,  even  though  the  most  com- 
mon allele  is  expected  to  have  a frequency  of  p\,  the  identity  of  the  most 
common  allele  is  expected  to  change  with  time.  In  the  steady-state  popula- 
tion, not  all  alleles  are  equally  frequent,  and  F is  greater  than  it  would  be 
were  all  alleles  equally  frequent. 

Consider  the  steady-state  distribution  of  allele  frequencies  from  the  point 
of  view  of  an  experimenter  taking  a sample  from  a population.  Let  the  sam- 
ple size  be  n genes,  and  suppose  there  are  k different  alleles  in  this  sample. 
The  sample  might  consist  of,  for  example,  10  unique  alleles,  3 alleles  that  are 
represented  twice  in  the  sample,  7 alleles  that  are  present  3 times,  and  so  on. 
Such  a description  of  the  sample  is  called  the  allelic  configuration  or  parti- 
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tion.  A remarkable  finding  of  Ewens  was  that  the  expected  configuration  of  a 
sample  drawn  from  a population  obeying  the  infinite-alleles  model  is  entire- 
ly determined  by  the  sample  size,  n,  and  the  number  of  observed  alleles,  k. 
Ewens  showed  that  the  expected  number  of  alleles  in  the  sample,  given  0 and 
the  sample  size,  is 


E(k)  = 1 + 


9 


0 


e+i  e+2 


- + •••  + - 


0 


0 + n-l 


7.31 


If  0 is  very  small,  E(k)  ~ 1,  whereas  for  very  large  0,  E(k)  approaches  n, 
implying  that  for  a large  enough  population  with  a high  enough  mutation 
rate,  every  allele  that  is  sampled  will  be  different.  The  form  of  Equation  7.31 
suggests  that,  as  the  sample  size  increases,  more  alleles  will  be  found,  but 
that  there  is  a diminishing  return  in  finding  new  alleles  when  the  sample  size 
increases.  When  E(k)  is  plotted  against  0 (Figure  7.14),  the  increase  in  the 
expected  number  of  alleles  is  greatest  for  larger  sample  sizes  when  the  popu- 
lation is  highly  diverse  (large  0). 

The  infinite-alleles  model  gives  a steady-state  prediction  of  F given  0 
(because  F = 1/ (1+0)  from  Equation  7.30),  and  a prediction  of  k from  Equa- 
tion 7.31.  Combining  these  predictions,  the  expected  relation  between  F and 
k is  plotted  in  Figure  7.15.  The  hyperbolic  relation  is  not  surprising,  because  a 
population  with  many  alleles  will  generally  have  a lower  probability  of  iden- 
tity of  a randomly  chosen  pair  of  alleles.  For  0 = 1,  the  expected  F is  l/2  for  all 


Figure  7.1 4 Relations  between  0,  the  expected  number  of  alleles,  and  the  sam- 
ple size  according  to  the  Ewens-Watterson  sampling  theory  of  a population  in 
steady-state  under  the  infinite-alleles  model  of  neutral  mutation. 
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Expected  number  of  alleles,  E(k) 


Figure  7.1 5 The  infinite-alleles  model  prediction  of  the  relation  between  the 
expected  number  of  alleles  and  the  expected  gene  identity  F.  The  three  curves 
represent  a range  of  values  of  0 = 4 Np,  starting  at  0 = 0.1  in  the  upper  left,  and 
ending  with  0 = 10  in  the  lower  right.  For  the  value  of  0 = 1,  the  expected  F, 
given  by  the  relation  F = 1 / (1  + 0),  is  l/2>  regardless  of  the  sample  size.  Larger 
sample  sizes  always  lead  to  larger  expected  numbers  of  alleles,  but  the  differ- 
ence is  greater  in  more  diverse  populations  (those  with  smaller  F). 


sample  sizes,  but  a larger  sample  size  should  yield  a greater  number  of  dis- 
tinct alleles. 

The  Ewens-Watterson  Test 

The  Ewens  sampling  theory  expressed  in  Equation  7.31  shows  that  the  sam- 
ple size  and  the  number  of  distinct  alleles  observed  in  the  sample  are  suffi- 
cient to  give  an  expected  configuration  of  allele  counts.  From  the  observed 
and  expected  configurations,  a number  of  test  statistics  can  be  devised  to 
determine  whether  the  observed  sample  fits  the  expected  values  of  the 
model.  Figure  7.16  shows  histograms  of  the  observed  and  expected  allele  fre- 
quency configurations  for  alleles  in  a human  population  defined  by  a VNTR 
polymorphism.  In  this  particular  example,  there  appears  to  be  a slight  excess 
of  the  common  allele,  which  is  consistent  with  any  number  of  causes  of 
departure  from  the  infinite-alleles  model. 

Keith  et  al.  (1985)  isolated  89  homozygous  lines  from  a sample  of 
Drosophila  pseudoobscura  collected  at  the  Gundlach-Bundschu  Winery  in 
Sonoma  Valley,  California.  Homogenized  tissue  from  these  89  lines  was  then 
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Allele  rank 

Figure  7.1 6 Observed  (open  columns)  and  expected  (black  bars)  allele  fre- 
quency distribution  of  the  HRAS-1  locus  in  humans,  identified  by  Southern 
blotting  with  the  pLM0.8  probe  and  Tap I digests.  Observed  data  are  from  Baird 
et  al.  (1986),  and  the  expected  distribution  was  generated  using  Ewens'  sam- 
pling theory.  In  this  sample  of  490  genes  there  were  14  distinct  alleles,  four  of 
which  were  present  in  just  one  individual.  (From  Clark  1988.) 


subjected  to  sequential  electrophoresis  (a  sensitive  means  of  detecting  charge 
and  conformation  differences  among  the  protein  products),  and  stained  to 
reveal  differences  in  xanthine  dehydrogenase  ( Xdh ) mobility.  They  obtained 
a common  allele  that  was  present  in  52  of  the  lines,  one  allele  that  was  pre- 
sent in  nine  lines,  one  allele  that  was  present  in  eight  lines,  two  alleles  present 
in  four  lines  each,  two  alleles  that  were  present  in  two  lines  each,  and  eight 
singleton  or  unique  alleles. 

To  test  whether  the  observed  configuration  fits  the  expectation,  a comput- 
er simulation  was  run  to  generate  realizations  of  samples  from  populations 
that  obey  the  infinite-alleles  model,  having  the  same  number  of  alleles  and 
sample  size  as  the  observed  data.  The  algorithm  to  do  this  simulation  is 
described  by  F.  Stewart  in  the  Appendix  to  Fuerst  et  al.  (1977),  and  a listing  of 
a program  can  be  found  in  Manly  (1985).  From  each  computer-generated  sam- 
ple, F is  calculated  as  the  sum  of  the  squared  allele  frequencies.  Figure  7.17 
shows  a histogram  of  the  computer-generated  distribution  of  F,  along  with 
an  arrow  showing  where  the  Drosophila  sample  fell.  The  sample  had  an 
observed  F that  fell  in  the  upper  tail  of  the  distribution,  and  since  so  few  val- 
ues of  F from  the  null  hypothesis  were  larger  than  the  observed  F,  Keith  et  al. 
rejected  the  null  hypothesis  and  argued  that  the  data  did  not  fit  the  infinite- 
alleles  model  satisfactorily.  The  departure  was  in  the  direction  of  excess 
homozygosity  (deficit  of  H),  but  since  the  populations  were  probably  in 
Hardy- Weinberg  proportions,  a clearer  way  to  state  the  result  would  be  to  say 
that  there  was  a deficit  of  genetic  diversity  for  the  given  number  of  observed 
alleles.  The  deficit  means  that  the  common  allele  is  more  common  than 
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Figure  7.1 7 Computer-generated  distribution  of  F obtained  from  1000  sam- 
ples from  a population  obeying  the  assumptions  of  the  infinite-alleles  model 
with  k = 15  alleles  and  a sample  of  size  n = 89  (as  in  the  Xdh  data  from  a sample 
of  Drosophila  pseudoobscura  from  the  Gundlach-Bundschu  Winery  studied  by 
Keith  et  al.  1985).  The  mean  of  F from  the  simulation  was  0.168,  which  is  well 
below  the  observed  F of  0.366.  A significant  departure  of  the  observed  F from 
the  predictions  of  the  model  is  noted  by  the  small  area  under  the  tail  of  the  dis- 
tribution to  the  right  of  the  arrow. 


expected,  and  there  are  also  more  singletons  than  expected.  This  pattern  of  fre- 
quencies is  consistent  with  purifying  selection  acting  to  eliminate  the  rare, 
slightly  deleterious  alleles  that  continually  enter  the  population  by  mutation.  It 
is  also  consistent  with  an  historical  effect  in  which  many  alleles  may  have  been 
previously  lost  and  the  population  has  not  yet  had  time  to  return  to  equilibrium. 

The  results  of  the  Ewens-Watterson  test  can  also  be  reported  graphically 
as  in  Figure  7.18.  Each  gene  yields  a point  specified  by  the  number  of  distinct 
alleles  and  the  observed  F.  The  two  curves  represent  the  95%  confidence 
interval  generated  by  the  Ewens  sampling  theory.  A quick  check  of  the  con- 
cordance of  the  data  with  the  model  can  be  made  by  seeing  whether  points 
remain  in  this  confidence  region.  Although  Xdh  in  Drosophila  pseudoobscura 
provides  a dramatic  departure  from  the  infinite-alleles  model,  results  like 
those  plotted  in  Figure  7.18,  which  show  an  acceptable  fit  to  neutrality,  are 
more  commonly  obtained. 

INFINITE-SITES  MODEL 

Rather  than  considering  each  mutation  as  generating  a unique  allele,  with 
infinitely  many  possible  alleles,  we  can  instead  consider  an  allele  as  a 
sequence  of  nucleotides  with  mutation  altering  a site  in  the  sequence.  If  the 
mutation  rate  is  sufficiently  low,  then  most  sites  will  be  monomorphic,  and 
all  polymorphic  sites  will  be  segregating  for  just  two  nucleotides.  Much  of 
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Figure  7.18  Gene  identity  (F)  plotted  against  the  observed  number  of  alleles 
in  a sample  of  279  E.  coli.  The  solid  lines  represent  the  upper  97.5%  and  lower 
2.5%  confidence  limits,  and  the  observation  that  all  of  the  tested  loci  fall  within 
these  limits  suggests  good  concordance  with  the  infinite-alleles  model  of  neutral 
mutation.  (From  Whittam  et  al.  1983.) 


the  available  data  on  allelic  variation  in  DNA  sequence  seems  consistent  with 
this  view:  few  nucleotide  sites  are  segregating  for  more  than  two  nucleotides. 
If  the  DNA  sequence  is  sufficiently  long  and  the  frequency  of  polymorphic 
sites  low,  then  most  of  the  time  new  mutations  will  occur  at  sites  that  were 
previously  monomorphic.  The  infinite-sites  model,  based  on  these  assump- 
tions, was  developed  by  Kimura  (1969, 1971),  who  considered  nucleotides  as 
unlinked,  and  by  Watterson  (1975),  who  took  account  of  the  nearly  complete 
linkage  among  sites. 

The  infinite-sites  model  is  appealing  because  it  directly  addresses  the  type 
of  data  that  molecular  population  geneticists  can  collect.  Given  an  array  of 
DNA  sequences  of  alleles  randomly  sampled  from  a population,  there  is  con- 
siderable information  about  the  history  of  the  alleles  hidden  in  the  patterns  of 
similarity  across  alleles.  The  infinite-alleles  model  ignores  this  pattern  and 
simply  considers  the  alleles  as  distinct.  A much  more  powerful  treatment  is 
to  tabulate  the  number  of  sites  at  which  all  pairwise  combinations  of 
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sequences  differ,  resulting  in  a so-called  mismatch  distribution.  The  infinite- 
sites  model  addresses  the  theoretically  expected  behavior  of  the  mismatch 
distribution.  Watterson  (1975)  considered  the  distribution  of  S„  defined  as  the 
number  of  segregating  sites  in  a sample  of  i genes.  For  the  case  of  a random 
sample  of  two  genes,  Watterson  showed  that  the  steady-state  probability  that 
the  sequences  have  i mismatches  is 


7.32 


where  0 = 4 Ng,  and  p is  the  mutation  rate  per  gene  (not  per  site).  A particular 
case  of  this  equation  gives  the  probability  that  two  sequences  have  no  sites 
different,  and  hence  are  identical.  Substituting  i = 0 into  Equation  7.32,  we  get 


Pr(S2=  0) 


1 

9 + 1 


7.33 


in  agreement  with  the  infinite-alleles  model,  because  Pr(S2  = 0)  = F,  the  proba- 
bility that  two  alleles  drawn  at  random  are  identical.  The  mean  and  variance  in 
the  distribution  of  number  of  segregating  sites  are  0 and  0 + 0:,  respectively. 

In  reality  we  do  not  sample  an  entire  population,  so  it  is  important  to  deter- 
mine the  statistical  properties  of  a smaller  sample  drawn  from  a population. 
Often  the  sampling  properties  of  population  genetic  models  are  very  complex 
and  we  have  to  resort  to  simulations  for  meaningful  estimates.  A few  results 
have  been  obtained  for  samples  drawn  from  a population  obeying  the  infinite- 
sites  model,  and  these  results  are  very  useful  for  testing  goodness  of  fit  to  the 
model.  The  expected  number  of  segregating  sites  in  a sample  of  n alleles  is 

H-l 

E(S)  = e£y  7.34 


and  the  variance  in  the  number  of  segregating  sites  is 


n- 1 -i  n- 1 -i 

p(s)  = 0£T  + e2X 


i= 1 


i=l 


7.35 


This  expression  for  the  variance  is  for  the  case  of  no  intragenic  recombi- 
nation. It  turns  out  that  intragenic  recombination  does  not  affect  E(S),  but  it 
reduces  V(S).  This  is  not  hard  to  see  intuitively — recombination  shuffles  the 
variation  among  alleles,  reducing  the  average  number  of  sites  by  which  ran- 
dom pairs  of  alleles  differ.  The  expression  for  the  variance  in  the  number  of 
mismatching  sites  in  the  case  of  free  recombination  across  sites  is 

n + l 2(n2  + n + 3 ) „ 

= , 1 9 + — — -^02 

3(n-l)  9n(n-l) 


7.36 
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Figure  7.19  Equilibrium  distribution  of  the  number  of  mismatches  between  a 
pair  of  alleles.  Note  that  if  there  is  free  recombination,  the  variance  is  smaller 
compared  to  the  case  of  no  recombination. 

Figure  7.19  shows  the  mismatch  distributions  for  a simulated  set  of  data 
with  free  recombination  (smaller  variance)  and  with  no  recombination  (larg- 
er variance).  The  relationship  between  the  mean  and  the  variance  in  the  mis- 
match distribution  can  be  used  to  make  inferences  about  intragenic 
recombination  (Hudson  1987). 

The  assumptions  of  the  infinite-alleles  model  and  the  infinite-sites  model 
do  not  seem  to  be  entirely  at  odds  with  one  another,  and  we  saw  that  they 
predict  the  same  steady  state  value  for  F.  But  the  two  models  do  make  use  of 
different  aspects  of  the  data,  and  so  it  would  seem  that  a test  of  the  consis- 
tency between  the  two  models  might  serve  as  a useful  test  for  the  neutral  the- 
ory. The  next  problem  makes  use  of  just  this  test,  which  was  devised  by 
Tajima  (1989). 


PROBLEM  7.10  The  average  heterozygosity  for  pairs  of  randomly 
chosen  alleles  under  the  infinite-alleles  model  is  E(k)  = 0,  and  the 
expected  number  of  sites  segregating  in  a sample  (under  the  infinite- 
sites  model)  is 
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Two  estimates  of  0 are  therefore  k,  the  average  heterozygosity,  and 


Tajima  (1989)  devised  a test  statistic  to  test  the  null  hypothesis  that 
these  two  estimates  were  identical.  The  test  statistic  is  the  difference 
between  these  two  estimates  of  0,  or 


D = E(k)  - 


If  a population  were  growing  rapidly,  one  might  expect  this  to  affect 
both  the  number  of  segregating  sites  and  the  heterozygosity.  Predict 
the  direction  of  change  of  F (probability  of  identity),  S (the  number  of 
segregating  sites),  and  D (Tajima's  test  statistic). 


ANSWER  First  consider  a larger  population  at  equilibrium.  Since 
F = 1/  (4Ng  + 1),  a larger  population  would  have  a lower  F.  A larger 
population  would  also  have  a larger  number  of  segregating  sites  (S), 
and  a higher  per-site  heterozygosity  (k).  At  equilibrium,  if  the  gene  is 
neutral,  then  the  Tajima  statistic  should  be  zero.  In  a growing  popu- 
lation, F will  decrease  as  added  variation  accumulates,  S will 
increase  and  k will  increase.  The  key  point  is  that  the  increase  in  vari- 
ation will  occur  in  initially  rare  alleles,  which  contribute  to  S but  only 
a little  to  k.  Thus,  S grows  faster  than  k,  and  D will  be  negative.  If  the 
population  stops  growing,  then  Tajima's  D statistic  will  return  to 
zero  at  equilibrium. 


GENE  TREES  AND  THE  COALESCENT 

A sample  of  genes  from  a population  represents  more  than  a snapshot  of 
counts  of  alleles  in  a population.  Each  gene  that  is  sampled  has  an  ancestral 
history  dating  back  hundreds  or  thousands  of  generations.  It  is  possible  that 
a pair  of  genes  sampled  today  may  have  come  from  identical  copies  of  the 
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same  allele  produced  by  the  same  individual  just  a few  generations  ago.  Or 
the  alleles  may  have  had  common  ancestry  hundreds  of  generations  ago.  The 
term  coalescence  refers  to  this  process,  looking  backward  in  time,  and  seeing 
how  two  genes  merge  at  times  of  common  ancestry.  Along  this  process,  one 
goes  from  a sample  of  k genes,  to  k - 1 ancestors  after  the  first  coalescence,  to 
k-2  ancestors  after  the  second  coalescence,  and  so  forth  until  there  is  a single 
common  ancestor  for  the  whole  sample.  The  idea  of  the  coalescent  is  to  con- 
sider the  ancestral  history  of  genes  in  a sample  by  developing  a model  for  the 
time  to  common  ancestry  (Kingman  1980). 

To  understand  how  the  coalescent  process  works,  consider  in  Figure  7.20 
what  happens  as  time  moves  forward.  In  each  generation  there  are  a number 
of  alleles  in  the  population,  and  those  alleles  may  be  reproduced  and  be  pre- 
sent in  the  following  generation  (moving  down  the  figure),  or,  in  some  cases, 
an  allele  is  not  reproduced  and  is  lost  from  the  population.  By  chance,  some 
alleles  may  be  sampled  twice  in  constituting  the  next  generation,  and  the 
probabilities  of  these  events  are  the  same  as  those  under  the  Wright-Fisher 
model  of  random  genetic  drift.  By  a repetition  of  this  process  over  time,  even- 
tually one  of  the  original  alleles  will  become  "fixed"  in  the  population.  In  the 
absence  of  mutation,  the  population  would  therefore  be  fixed  for  the  same 


Generations 

ago 


Present 


Figure  7.20  Diagram  showing  paths  of  ancestry  of  a set  of  alleles  sampled  at 
the  present.  The  population  is  represented  as  having  a constant  size.  Starting  at 
the  top  and  working  down,  notice  that  many  alleles  go  extinct,  and  one  allele 
goes  to  fixation.  Considering  this  process  in  reverse,  the  current  sample 
observed  at  present  undergoes  a series  of  coalescence  events  in  which  the  k alle- 
les present  in  the  current  generation  had  only  k-  1 ancestors.  This  process  con- 
tinues backward  in  time  until  there  is  only  one  ancestral  allele. 
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allele;  however,  because  mutation  may  occur  during  the  process,  the  alleles 
observed  at  the  present  will  not  all  be  identical  in  nucleotide  sequence,  even 
though  they  all  descended  from  a single  common  ancestral  allele. 

In  reality  we  do  not  have  the  genealogical  information  enabling  us  to  fol- 
low all  the  alleles  through  time  in  a population.  Typically  what  we  have  is  a 
single  "snapshot"  represented  by  a small  sample  of  alleles  taken  at  the  pre- 
sent time.  Now  consider  Figure  7.20  again,  but  this  time  look  at  what  hap- 
pens when  we  go  backwards  in  time.  We  start  with  the  k alleles  in  the  sample 
at  generation  0.  In  going  from  generation  0 to  generation  1 (one  generation 
ago),  we  see  that  the  two  rightmost  alleles  "coalesced"  into  a single  ancestral 
allele.  As  we  go  further  back  in  time,  the  number  of  ancestral  alleles  has  to 
either  remain  the  same  or  decrease,  and  each  reduction  in  the  number  of 
ancestral  alleles  is  called  a coalescence  event.  In  order  to  show  how  this  idea 
can  be  extended  to  derive  expressions  for  the  entire  distribution  of  branch 
lengths  of  a gene  tree,  we  next  specify  a model. 

Consider  two  alleles.  The  probability  that  the  two  alleles  came  from  the 
same  allele  in  the  previous  generation  is  1/2N  (in  a diploid  population  of  size 
N),  so  the  chance  that  they  came  from  two  distinct  alleles  the  previous  gener- 
ation is  1 - 1/2N.  The  probability  that  three  alleles  had  three  distinct  ances- 
tral alleles  the  previous  generation  is  Pr(alleles  1 and  2 have  distinct 
ancestors)Pr(allele  3 is  different  from  both  1 and  2)  = (1  - 1/2N)(1  - 2/ IN).  In 
general,  the  probability  that  k alleles  had  k distinct  parental  alleles  the  previ- 
ous generation  is 


PrW=n(>-A)- 

Each  generation  the  sampling  process  occurs  independently  of  what  hap- 
pened before,  and  so  the  probability  that  k alleles  had  k distinct  parental  alle- 
les two  generations  ago  is  the  square  of  the  right-hand  side  of  Equation  7.37. 
Consider  two  alleles  again.  Suppose  we  wish  to  know  the  chance  that  the 
common  ancestor  of  these  two  alleles  occurred  exactly  t generations  ago.  In 
this  case  there  must  have  been  no  coalescence  (i.e.,  two  distinct  ancestral  lin- 
eages were  found)  for  t - 1 generations,  and  then,  in  the  next  preceding  gen- 
eration, a coalescence  occurred.  The  probability  of  not  coalescing  for  t 
generations  is  (1  - 1/2 N)1  and  the  chance  of  the  two  alleles  coalescing  in  any 
one  generation  is  1 / 2 N.  The  desired  probability  is  the  product  of  these  or 


(k) 

2, 

1-—  7.37 

2 N 


Pr  [2  alleles  had  common  ancestor  t generations  ago) 


^[1-(1/2N)]' 

1 c-t/(2N) 

2 N 


7.38 
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The  exponential  is  an  approximation  that  is  quite  good  when  1/2N  is  small. 
This  distribution  has  a mean  of  2 N generations  and  a variance  of  4 N2.  Note 
that  the  confidence  interval  around  the  mean  time  is  not  very  tight,  since  the 
standard  deviation  of  the  distribution  is  equal  to  the  mean. 

Returning  to  our  sample  of  k alleles,  the  probability  that  the  k alleles  do 
not  coalesce  for  t generations,  then  one  pair  coalesces  to  give  k - 1 alleles  at 
t + 1 generations  ago  is  as  follows: 

Pr (k  ancestors  for  t generations,  k - 1 ancestors  at  t + 1 generations  ago) 

= Pr(fc)'[l-Pr(A:)] 


/lr\ 


A 

2 N 


exp 


A 

2N 


7.39 


This  approximation  is  valid  if  k « N.  The  distribution  in  Equation  7.39  has 
a mean  of  4 N/[k(k  - 1)]  generations  and  a variance  of  16 N2/[k(k  - l)]2.  Figure 
7.21  shows  what  the  gene  genealogy  is  expected  to  be.  Starting  with  five  alle- 
les, the  first  coalescence  is  expected  to  occur  2N/10  generations  ago,  the  next 
at  2N/6  generations  prior  to  that,  and  so  on.  Note  that  the  time  intervals  get 


Figure  7.21  The  process  of  coalescence  can  be  represented  by  a gene  tree.  At 
each  generation,  if  there  are  k alleles  present,  the  expected  time  back  to  the  next 

coalescence  is  2N/|2  j ■ Starting  with  five  alleles,  the  expected  time  back  to  the 

first  coalescence  is  2N/10.  Note  that  the  successive  times  get  longer.  When  there 
are  only  two  alleles,  the  time  back  to  the  final  coalescence  is  2 N generations. 
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longer  and  longer  as  the  number  of  lineages  decreases.  The  distribution  of 
each  of  these  time  intervals  is  exponential,  with  ever-increasing  means  as  one 
goes  back  in  time.  The  time  to  the  coalescence  of  all  of  the  k alleles  (i.e.,  the 
most  recent  time  that  one  sample  of  n alleles  shared  a common  ancestor)  is 

t = 4N(1-1/ k)  7A0 


with  variance 


V 


7.41 


(Kingman  1982;  Tajima  1983).  As  the  sample  size  k increases  toward  the  total 
population  size,  t approaches  4N,  which  equals  the  expected  fixation  time  for 
a newly  arisen  mutation.  These  principles  allow  us  to  generate  simulated 
gene  genealogies  whose  branch  lengths  correspond  to  the  assumptions  of  the 
Wright-Fisher  model.  One  thing  the  model  still  lacks  is  mutation,  which  is 
introduced  in  the  next  discussion. 


Coalescent  Models  with  Mutation 

In  order  to  generate  simulated  gene  sequence  data  representing  samples 
drawn  from  a population  obeying  the  infinite  sites  model,  Hudson  (1990, 
1993)  showed  that  one  can  proceed  as  follows: 

• Determine  the  sample  size  k and  the  0 for  the  gene  region  of  interest; 

• Draw  random  numbers  with  appropriate  exponential  distributions  to 
construct  a gene  genealogy  such  that  times  of  coalescence  follow  Equa- 
tion 7.39; 

• On  each  branch  of  this  tree,  distribute  mutations  with  a Poisson  distribu- 
tion on  each  branch,  such  that  the  mean  number  of  mutations  on  each 
branch  is  given  by  2N|if,  where  t is  the  branch  length. 

This  procedure  has  been  widely  used  in  generating  data  sets  under  the 
neutral  hypothesis  for  comparison  to  observed  data  sets. 

From  Figure  7.21,  it  follows  that  the  sum  of  the  branch  lengths  for  the 
entire  gene  tree  is 

k 

T = ^iT,  7.42 

i= 2 

The  expected  number  of  segregating  sites  in  the  whole  sample  is  2 NpT, 
where  T is  the  sum  of  the  branch  lengths,  so  substituting  we  get 

E(S)  = 2NfiT  = HiE(T,)  = 6jj- 

Z i-2  i=l  ' 


7.43 
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The  rightmost  expression  agrees  with  Equation  7.34,  which  we  derived  for 
the  infinite-sites  model. 

The  coalescent  approach  can  be  used  to  derive  many  fundamental  princi- 
ples in  population  genetics.  As  one  example,  consider  a population  presently 
in  mutation-drift  equilibrium.  In  the  previous  generation,  a pair  of  alleles  can 
either  coalesce,  with  probability  1/2 N,  or  failing  to  coalesce,  one  or  the  other 
allele  may  mutate  with  probability  2p.  (The  factor  2 comes  in  because  either 
copy  can  mutate.)  These  are  the  only  two  events  that  affect  identity,  and  the 
sum  of  their  probabilities  is  1/2N  + 2p.  The  probability  of  identity  is  therefore 
the  fraction  of  the  time  that  the  alleles  coalesce: 


F = 


1 

2 N 


2 N 


+ 2p 


1 

1 + 0 


7.44 


We  have  already  derived  this  equilibrium  identity  under  the  infinite-sites 
(and  infinite-alleles)  models.  Coalescence  methods  are  not  limited  to  the  con- 
sideration of  the  Wright-Fisher  model.  If  one  can  develop  a recursion  equa- 
tion for  probabilities  of  recombination,  migration,  or  other  such  phenomena 
in  a gene  tree  context,  then  often  powerful  insights  can  be  derived  from  coa- 
lescence approaches.  For  our  purposes,  suffice  it  to  say  that  the  method  can 
generate  classical  results,  often  with  much  less  difficulty,  and  the  coalescence 
approach  is  especially  well  suited  to  testing  hypotheses  about  samples  drawn 
from  populations. 


* , * fiM  * m 

PROBLEM  7.1 1 The  probability  distribution  for  the  number  of  gen- 
erations back  to  the  first  coalescence  (in  a pure  drift  model)  in  a sample 
of  k genes  taken  from  a haploid  population  of  size  N is  approximately: 


Pr  (first  coalescence  t generations  ago)  = xe~xt,  where  x ■ 


fk' 

A 

N 


From  this  one  can  show  that  the  mean  number  of  generations  back  to 
the  first  coalescence  is  l/x.  The  more  genes  in  the  sample,  the  more 
likely  it  will  be  that  a coalescence  occurred  recently.  Calculate  the 
expected  time  to  first  coalescence  in  a population  of  N = 450  for  a 
sample  of  10  genes.  How  many  genes  would  you  have  to  sample  to 
halve  this  coalescence  time? 
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ANSWER  The  expected  time  to  first  coalescence  in  a population  of 
N = 450  for  a sample  of  10  genes  is 


'V 

2, 


: 450/ 


flO^ 

2 


= 450 /(10 x 9/2)  = 10  generations. 


To  determine  how  many  genes  one  would  have  to  sample  to  halve 
this  coalescence  time,  solve  for 


5 = 450/ 


V 

v2j 


This  is  equivalent  to  90  = k\/[2\(k  - 2!)].  By  trial  and  error,  you  will 
find  that  a sample  of  14  genes  will  do  it.  Note  that  by  increasing  the 
sample  only  from  10  to  14,  we  expect  to  find  a pair  of  alleles  half  as 
divergent  from  each  other. 


SUMMARY 

Gene  frequencies  fluctuate  at  random  in  finite  populations.  The  rate  at  which 
allele  frequencies  change  varies  inversely  with  population  size.  The  reason 
for  the  inverse  relationship  is  that  the  sampling  variance,  when  two  alleles 
are  segregating  in  a population,  is  determined  by  the  binomial  sampling 
process,  and  the  binomial  variance  is  pcj/2N.  The  Wright-Fisher  model 
extended  the  idea  of  binomial  sampling  over  multiple  generations,  and  much 
of  our  understanding  of  drift  has  been  derived  from  this  model.  In  a popula- 
tion in  which  the  only  force  acting  on  gene  frequencies  is  random  drift,  all 
variation  must  ultimately  be  lost.  The  Wright-Fisher  model  shows  why  the 
probability  that  an  allele  will  drift  to  fixation  is  equal  to  its  initial  frequency 
in  the  population.  The  diffusion  approximation  of  the  Wright-Fisher  model 
is  a second-order  partial  differential  equation  that  yields  the  distribution 
<Kx,t),  giving  the  number  of  populations  with  allele  frequency  x and  time  t. 
The  diffusion  approach  has  yielded  important  insights  into  the  consequences 
of  drift,  including  the  expected  time  to  fixation  and  loss  of  alleles.  The  expect- 
ed time  to  fixation  of  a newly  introduced  allele  is  4 N generations,  showing 
once  again  that  drift  happens  faster  in  smaller  populations. 

A useful  way  to  think  about  random  drift  is  to  consider  a set  of  subpopu- 
lations of  the  same  size  undergoing  repeated  generations  of  sampling  and 
drift.  Within  each  of  these  subpopulations,  genotypes  are  composed  by 
drawing  alleles  at  random,  so  that  each  subpopulation  is  always  in  Hardy- 
Weinberg  equilibrium.  The  hypothetical  population  composed  by  pooling 
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the  subpopulations  will  have  a deficit  of  heterozygotes  because,  as  allele  fre- 
quencies drift  closer  to  fixation,  the  frequency  of  heterozygotes  declines.  The 
rate  at  which  heterozygosity  is  lost  in  a finite  population  is  (1  - 1/2 N),  so  that 
a population  of  size  10,  say,  loses  5%  of  its  heterozygosity  each  generation. 
The  allele  frequencies  of  the  subpopulations  are  equally  likely  to  drift  up  as 
down,  so  the  average  allele  frequency  over  subpopulations  shows  no  change. 

Real  biological  populations  do  not  precisely  fit  the  Wright-Fisher  model. 
They  generally  exhibit  changes  in  allele  frequency  that  exceed  the  amount 
expected  based  on  the  actual  population  size.  The  usual  reason  for  the  discrep- 
ancy is  that  the  drift  process  occurs  as  though  there  are  fewer  than  the 
observed  census  number  of  individuals.  The  models  give  better  correspon- 
dence to  reality  by  calculating  the  effective  population  size.  Several  different 
factors  that  require  consideration  in  calculating  effective  size  were  examined  in 
this  chapter,  including  unequal  sex  ratio,  fluctuation  in  population  size  over 
generations,  and  the  uniparental  transmission  of  mtDNA  and  Y chromosomes. 

Mutation  introduces  variation  into  populations,  and  random  genetic  drift 
erodes  that  variation.  These  two  forces  come  to  a steady  state  predicted  by 
population  genetic  models.  The  infinite-alleles  model  assumes  that  each  new 
mutation  generates  a novel  allele.  The  steady-state  balance  between  mutation 
and  drift  in  the  infinite-alleles  model  is  given  by  the  autozygosity,  F,  which  is 
also  the  probability  that  two  alleles  are  identical  by  descent.  Under  the  infi- 
nite-alleles model  for  a diploid  population,  F = 1/(1  + 0),  where  0 = 4 Np. 
Note  that  the  mutation  rate  and  population  size  are  confounded  in  this 
model,  increasing  either  one  will  decrease  the  autozygosity  by  the  same 
amount.  We  can  write  the  same  equation  in  terms  of  heterozygosity  H - 1 - F, 
giving  H = 0/(1  + 0).  The  infinite-sites  model  is  related  to  the  infinite-alleles 
model,  but  more  specifically  states  that  novel  mutations  occur  at  a site  along 
the  gene  that  has  not  mutated  before.  (If  this  is  true,  each  new  mutation  must 
also  generate  a novel  allele.)  The  infinite-sites  model  generates  predictions 
about  the  number  of  segregating  sites  expected  in  a population  at  steady 
state.  Here  the  result  is  that  the  expected  number  of  segregating  sites  is 

E(S)  = 0 V where  the  mutation  rate  in  this  value  of  0 is  the  mutation  rate 

i 

over  the  entire  gene  in  question. 

Classical  models  of  random  genetic  drift  look  forward  in  time,  following 
alleles  as  they  are  lost  from  the  population  and  generated  anew  by  mutation. 
More  recently,  the  coalescent  approach  has  been  to  look  backward  in  time, 
starting  with  the  observed  sample  of  alleles,  and  calculating  times  to  com- 
mon ancestry  of  alleles.  Coalescent  approaches  are  particularly  appropriate 
when  one  wants  to  consider  the  probability  that  a particular  observed  set  of 
molecular  sequence  data  might  have  the  characteristics  expected  of  random 
genetic  drift.  Computer  generation  of  gene  trees  using  principles  of  coales- 
cence theory  makes  it  easy  to  produce  a null  distribution  giving  the  full  range 
of  outcomes  expected  under  a drift  model. 
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PROBLEMS 

1.  Suppose  that  in  one  generation,  in  a population  of  size  50,  the  average 
heterozygosity  (averaged  across  loci)  is  reduced  from  0.50  to  0.42.  Is  the 
population  mating  at  random? 

2.  In  how  many  generations  will  the  expected  heterozygosity  be  5%  of  the 
initial  value  in  a diploid  randomly  mating  population  of  size  of  10?  Size 
100? 

3.  A gene  in  one  individual  in  a population  of  24  barn  cats  undergoes  muta- 
tion to  a new  neutral  allele.  What  is  the  probability  that  the  allele  eventu- 
ally becomes  fixed?  What  is  the  probability  that  it  eventually  becomes 
lost?  What  are  the  answers  if  the  mutant  gene  is  X-linked  and  the  popu- 
lation consists  of  equal  numbers  of  males  and  females? 

4.  If  an  isolated  population  of  annual  alpine  plants  decreases  in  heterozy- 
gosity by  half  every  50  years  because  of  random  genetic  drift,  what  is  its 
effective  population  size? 

5.  Remote  Pitcairn  Island  in  the  South  Pacific  was  settled  in  1789  by  Fletch- 
er Christian  and  eight  fellow  mutineers  from  HMS  Bounty,  along  with  a 
small  number  of  Polynesian  women.  Although  many  descendants  have 
left  the  island  in  the  intervening  years,  there  has  been  essentially  no 
immigration.  Assuming  an  effective  size  of  20  in  each  of  the  eight  gener- 
ations since  the  island's  settlement,  what  value  of  FST  would  be  expected 
in  today's  population  from  random  genetic  drift? 

6.  In  a population  of  effective  size  N = 50,  how  long  is  required  for  random 
genetic  drift  to  double  the  value  of  the  fixation  index  F from  0.01  to  0.02? 
From  0.05  to  0.10?  Assuming  that  F is  small,  how  many  generations  are 
required  to  double  the  value  of  F in  a population  of  effective  size  N?  For 
the  latter,  use  the  approximations  that  [1  - (l/2N)]f  = exp(-f/2N)  and 
that,  when  F is  small  (F  < 0.10),  ln(l  - F)  = -F. 

7.  What  is  the  effective  population  number  in  a population  of  large  preda- 
tory cats  in  which  each  breeding  male  controls  a harem  of  five  females 
and  the  total  population  consists  of  200  males  and  200  females? 

8.  What  is  the  effective  population  size  of  a herd  of  ten  dairy  cows  and  one 
bull?  What  is  it  for  40  cows  and  one  bull?  For  10  cows  and  two  bulls? 

9.  What  is  the  variance  effective  population  size  for  an  X-linked  gene  in  a 
population  consisting  of  100  females  and  10  males?  In  a population  of  10 
females  and  100  males? 

10.  Among  100  restriction  site  differences  in  two  inbred  strains  of  the  flour 
beetle  Tribolium  that  are  crossed  and  allowed  thereafter  to  mate  at  ran- 
dom, what  number  of  restriction  sites  would  be  expected  to  remain  seg- 
regating after  10  generations  assuming  an  effective  population  size  of  80 
individuals?  Flow  many  would  be  expected  to  remain  unfixed  after  50 
generations? 
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11.  In  a haploid  population  of  constant  effective  size  50,  what  is  the  proba- 
bility that  two  randomly  drawn  alleles  shared  a common  ancestor  exact- 
ly 100  generations  ago? 

12.  Employing  the  infinite-sites  model,  if  0 = 10,  how  many  segregating  sites 
will  one  expect  to  find  in  a sample  of  size  10?  20?  50? 

13.  Consider  an  isolated  island  population  with  no  migration,  effective  pop- 
ulation size  of  250,000,  and  a mutation  rate  of  10“6.  Calculate  the  expect- 
ed heterozygosity  under  the  infinite-alleles  model.  How  much  migration 
is  necessary  to  increase  H to  %? 

14.  In  a haploid  population  of  effective  size  50,  how  large  a sample  must  one 
take  to  yield  an  expected  mean  coalescence  time  of  10  generations? 

15.  Show  that  random  genetic  drift  requires  an  average  of  t = 2N  In.r  genera- 
tions to  reduce  the  heterozygosity  from  H0  to  H0/x. 

16.  Use  Equation  7.15  to  show  that  approximately  2 N generations  of  random 
genetic  drift  are  required  to  reduce  the  number  of  segregating  genes  by  a 
factor  ofe(e  = 2.71828  . . .),  given  initial  allele  frequencies  close  to  0.5. 

17.  A set  of  six  to  eight  oocytes  from  each  of  three  women  undergoing  in 
vitro  fertilization  (IVF)  were  recently  tested  for  heteroplasmy  (presence 
of  more  than  one  mitochondrial  DNA  type  within  each  cell).  The  mtDNA 
from  eggs  of  two  women  were  all  identical  and  matched  that  of  somatic 
cells  with  no  heteroplasmy,  but  the  other  woman  produced  eggs  with  two 
different  mtDNA  types.  Densitometric  scans  allowed  investigators  to 
determine  that  the  individual  cells  had  relative  frequencies  of  the  two 
mtDNA  types  ranging  from  20%  to  50%.  Assuming  30  cell  generations 
from  zygote  to  zygote  in  the  maternal  germline,  and  N = 1000  mitochon- 
dria per  cell,  what  do  you  conclude  from  these  observations?  Are  they 
consistent  with  neutral  sampling  of  mtDNA  types? 
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ll  the  forces  in  population  genetics  have  an  impact  on  the 
pattern  of  variation  seen  in  molecular  sequences  of  genes,  includ- 
ing mutation,  migration,  selection,  and  random  drift.  A primary 
focus  of  molecular  population  genetics  is  to  make  inferences  about  the  con- 
tribution of  each  of  these  evolutionary  forces  to  produce  the  patterns  of  mol- 
ecular sequence  variation  we  see  today.  Usually  this  process  involves  a close 
interplay  between  mathematical  model  building,  statistical  parameter  esti- 
mation, and  experimental  observation.  Several  times  in  the  past,  unexpected 
patterns  of  sequence  variation  have  arisen  which,  in  turn,  gave  rise  to  whole 
new  avenues  of  theoretical  inquiry.  In  many  cases,  inferences  about  evolu- 
tionary forces  transcend  species  boundaries  by  making  use  of  data  on  both 
within-species  polymorphism  and  between-species  divergence.  The  genetic 
basis  for  species  isolation  is  itself  amenable  to  analysis.  But  first  let  us  begin 
with  the  basic  theoretical  principles  that  underlie  molecular  population 
genetics. 


THE  NEUTRAL  THEORY  AND  MOLECULAR  EVOLUTION 

The  first  systematic  application  of  protein  electrophoretic  methods  to  popu- 
lation genetics  revealed  extensive  genetic  variation  within  most  natural  pop- 
ulations. Typically,  15  to  50%  of  the  genes  coding  for  enzymes  were  observed 
to  include  two  or  more  widespread,  polymorphic  alleles.  The  polymorphic 
alleles  occurred  with  frequencies  considered  to  be  too  high  to  result  from 
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equilibrium  between  adverse  selection  and  mutation.  Motoo  Kimura  sug- 
gested that  most  polymorphisms  observed  at  the  molecular  level  are  selec- 
tively neutral,  so  that  their  frequency  dynamics  in  a population  are  deter- 
mined by  random  genetic  drift  (Kimura  1968).  By  extension,  the  hypothesis 
of  selective  neutrality  would  also  apply  to  most  nucleotide  or  amino  acid 
substitutions  that  occur  within  a molecule  during  the  course  of  evolution. 

The  neutral  theory  has  been  of  great  importance  in  population  genetics  in 
stimulating  the  collection  and  analysis  of  data  in  attempts  to  evaluate  its  ade- 
quacy. Mathematical  investigations  of  its  implications  have  resulted  in  one  of 
the  most  complete  and  elegant  theories  in  all  of  biology.  Tests  of  the  corre- 
spondence of  sample  data  to  the  neutral  theory  are  almost  universally  low  in 
power,  which  means  that  large  sets  of  data  are  needed  before  one  has  a rea- 
sonable chance  of  rejecting  neutrality.  The  recent  trend  has  been  that  more 
and  more  cases  of  departures  from  neutrality  are  being  found,  in  part 
because  of  the  expansion  in  available  data  and  in  part  because  of  the  increas- 
ing subtlety  of  tests  that  are  applied.  Regardless  of  the  action  of  other  forces 
shaping  molecular  sequence  variation  in  populations,  the  force  of  random 
drift  is  always  there,  and  for  this  reason  the  neutral  theory  remains  useful  in 
generating  rigorous  null  hypotheses.  The  next  section  summarizes  some  of 
the  theoretical  implications  of  the  neutral  theory  and  some  of  the  data  bear- 
ing on  it. 

Theoretical  Principles  of  the  Neutral  Theory 

The  neutral  theory  models  the  fate  of  mutations  that  are  so  nearly  selective- 
ly neutral  in  their  effects  that  their  fate  is  determined  largely  through  ran- 
dom genetic  drift.  A variety  of  mutation  models  have  been  considered, 
including  infinite-alleles,  infinite-sites,  and  finite-sites  models.  In  all  models, 
though,  random  drift  occurs  when  N adult  individuals  produce  an  infinite 
pool  of  gametes  from  which  2 N are  chosen  at  random  to  create  the  N zygotes 
of  the  next  generation.  Much  of  the  complexity  of  the  mathematics  of  the 
neutral  theory  arises  from  the  fact  that  the  mutational  histories  of  alleles  are 
not  independent,  because  they  share  an  overlapping  genealogical  history. 
Before  we  get  into  the  details  of  the  predictions  of  the  neutral  theory,  let  us 
first  review  some  of  the  theory's  principal  implications  (Kimura  1983). 

1.  If  a population  contains  a neutral  allele  with  allele  frequency  p0,  then  the 
probability  that  the  allele  eventually  becomes  fixed  equals  p0.  In  particu- 
lar, a newly  arising  neutral  mutation  occurs  in  just  one  copy,  so  the  initial 
allele  frequency  is  p0  = 1/2 N,  and  the  probability  of  eventual  fixation  of 
the  mutation  is  therefore  1/2N.  Figure  8.1  shows  that  a mutant  allele  aris- 
ing in  a smaller  population  has  higher  chance  of  fixation. 

2.  The  steady-state  rate  at  which  neutral  mutations  are  fixed  in  a population 
equals  p,  where  p is  the  neutral  mutation  rate.  It  is  noteworthy  that  the 
equilibrium  rate  of  fixation  does  not  involve  the  population  size  N.  The 


Molecular  Population  Genetics 


317 


Figure  8. 1 Diagram  showing  the  trajectory  of  neutral  alleles  in  a population. 
New  alleles  enter  the  population  by  mutation  and  have  an  initial  allele  frequen- 
cy of  1/2N.  Most  alleles  are  lost,  but  those  that  go  to  fixation  take  an  average  of 
4 N generations.  The  time  between  successive  fixations  of  neutral  alleles  is  1/p 
generations.  (A)  A moderate  size  population.  (B)  The  same  population  size;  a 
higher  mutation  rate  gives  the  same  time  to  fixation,  but  less  time  between  fixa- 
tions. (C)  A smaller  population  has  alleles  that  go  to  fixation  more  rapidly,  but 
the  time  between  fixations  is  still  1/p.  (After  Kimura  1980.) 


reason  is  that  the  N cancels  out:  The  overall  rate  is  determined  by  the 
product  of  the  probability  of  fixation  of  new  neutral  mutations  (1/2 N) 
and  the  average  number  of  new  neutral  mutations  in  each  generation 
(2Np),  hence  (1  /IN)  x (2Np)  = p. 

3.  The  average  time  that  occurs  between  consecutive  neutral  substitutions 
equals  1 /p.  This  principle  follows  directly  from  the  one  above.  If  the 
steady-state  rate  of  fixation  is  p per  unit  time,  the  average  length  of  time 
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between  substitutions  will  be  the  reciprocal,  or  1 /p.  By  way  of  analogy,  if 
a Swiss  clock  cuckoos  at  the  rate  of  24  times  per  day,  then  the  average 
length  of  time  between  cuckoos  is  l/24th  of  a day,  or  one  hour.  As  Figure 
8.1  shows,  the  time  interval  between  fixations  is  independent  of  popula- 
tion size,  and  elevating  the  mutation  rate  decreases  the  time  interval 
between  fixations. 


PROBLEM  8. 1 The  neutral  theory  makes  a strong  prediction  about 
the  relationship  between  population  size  and  heterozygosity.  Under 
the  infinite-alleles  model,  we  can  express  the  prediction  by  the  for- 
mula, H = 4Np/ (4Np+l),  and  hence  small  populations  should  have 
low  heterozygosity  and  large  populations  high  heterozygosity.  Do  the 
data  support  this  prediction?  A survey  of  77  species  reviewed  by  Nei 
and  Graur  (1984)  found  that  species  with  very  small  populations  (less 
than,  say,  104)  had  a mean  protein  heterozygosity  of  0.05,  whereas 
those  species  with  a very  large  population  (greater  than  109,  say, 
which  include  Drosophila  species),  have  heterozygosities  of  around 
0.2.  This  positive  correlation  seems  to  favor  the  neutral  theory,  except 
that  the  range  of  H is  much  smaller  than  theory  predicts  in  view  of  the 
enormous  range  in  N.  When  these  extremes  of  population  size  are 
excluded  (N  < 104  and  N > 109),  there  is  no  significant  correlation 
between  population  size  and  heterozygosity.  What  is  going  on? 


ANSWER  The  paradoxical  result  demonstrates  that  levels  of  vari- 
ability in  a population  are  determined  by  several  forces,  and  that  dif- 
ferent organisms  may  be  affected  by  the  forces  to  different 
magnitudes.  The  result  does  not  support  the  neutral  theory  insofar  as 
it  shows  that  population  size  does  not,  by  itself,  explain  levels  of  vari- 
ation. On  the  other  hand,  the  discrepancy  is  not  grounds  to  complete- 
ly toss  out  the  neutral  theory.  For  one  thing,  the  population  sizes  were 
generally  roughly  estimated,  and  effective  sizes  (Chapter  7),  which 
were  not  estimated,  are  more  relevant  to  neutral  predictions  of  het- 
erozygosity. There  is  also  an  implicit  assumption  that  mutation  rates 
are  identical  in  all  organisms,  and  violations  of  this  assumption  can 
be  found. 


4.  Analysis  of  the  diffusion  equation  has  shown  that,  among  newly  arising 
neutral  alleles  that  are  destined  to  be  fixed,  the  average  time  to  fixation  is 
4 Ne  generations  (where  Ne  is  the  effective  population  size).  This  too  is  evi- 
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dent  in  Figure  8.1:  alleles  that  go  to  fixation  do  so  in  less  time  in  the 
smaller  population.  Among  newly  arising  neutral  alleles  destined  to  be 
lost,  the  average  time  to  loss  is  (2N,,/N)ln(2N)  generations.  The  average 
times  required  for  fixation  or  loss  apply  to  newly  arising  alleles,  which 
are  necessarily  present  in  just  one  copy,  so  p0  = 1/2 N.  The  implication  of 
these  formulas  is  that,  on  average,  neutral  mutations  that  are  going  to  be 
fixed  require  a very  long  time  for  this  to  occur,  but  mutations  destined  to 
be  lost  are  lost  quite  rapidly. 

5.  If  each  neutral  mutation  creates  an  allele  that  is  different  from  all  others 
existing  in  the  population  in  which  it  occurs,  then,  at  equilibrium,  when 
the  average  number  of  new  alleles  gained  through  mutation  is  exactly  off- 
set by  the  average  number  lost  through  random  genetic  drift,  the  expected 
homozygosity  equals  1/ (4N,,p  + 1),  where  p is  the  neutral  mutation  rate. 
The  model  of  mutation  in  which  each  new  allele  is  novel  is  the  infinite- 
alleles  model  of  mutation.  The  quantity  4Nt,p,  which  shows  up  frequently 
in  the  neutral  theory,  is  often  denoted  as  9.  The  equilibrium  average 
homozygosity  is  therefore  1/(1  +9).  Since  the  heterozygosity  equals  one 
minus  the  homozygosity,  the  average  heterozygosity  at  equilibrium  in  the 
infinite-alleles  model  equals  9/ (1  + 9).  Larger  populations  are  expected  to 
have  a higher  heterozygosity,  as  reflected  in  the  greater  number  of  alleles 
segregating  at  any  one  time  in  the  larger  populations  in  Figure  8.2. 
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Figure  8.2  Given  the  enormous  variation  in  effective  population  sizes,  one 
would  expect  to  see  a wider  range  in  variation  in  heterozygosity  than  is  actually 
observed.  The  relation  between  population  size  and  heterozygosity  does  not  fit 
the  neutral  theory  expectation  over  a wide  range  of  intermediate  population 
sizes.  (After  Nei  and  Graur  1984.) 
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ESTIMATING  RATES  OF  MOLECULAR  SEQUENCE  DIVERGENCE 
Rates  of  Amino  Acid  Replacement 

The  initial  impetus  for  the  neutral  theory  came  from  observations  on  the  rate 
of  amino  acid  replacements  in  proteins.  When  extrapolated  to  the  entire 
genome,  the  inferred  rate  of  evolution  was  several  nucleotide  substitutions 
per  year.  This  rate  was  regarded  as  much  too  high  to  result  from  natural 
selection,  because  the  intensity  of  selection  must  be  limited  by  the  total 
amount  of  differential  survival  and  reproduction  that  occurs  in  the  organ- 
ism. Direct  DNA  sequencing  later  revealed  that  rates  of  nucleotide  substitu- 
tion vary  according  to  the  function  (or  presumed  absence  of  function)  of  the 
nucleotides.  The  type  of  data  that  must  be  analyzed  are  best  illustrated  by 
example.  The  first  18  amino  acids  present  at  the  amino  terminal  end  of  the 
human  and  mouse  y-interferon  proteins  constitute  a signal  peptide  that  is 
used  in  secretion  of  the  molecules  (Gray  and  Goeddel  1983).  The  sequences 
are: 


Human:  Met  Lys  Try  Thr  Ser  Tyr  lie  Leu  Ala  Phe  Gin  Leu  Cys  lie  Val  Leu  Gly  Ser 

Mouse:  Met  Asn  Ala  Thr  His  Cys  lie  Leu  Ala  Leu  Gin  Leu  Phe  Leu  Met  Ala  Val  Ser 


In  order  to  calculate  the  proportion  of  amino  acids  that  differ  in  the  two 
signal  sequences,  we  can  simply  count  the  number  of  sites  that  are  the  same 
and  the  number  of  sites  that  differ.  Among  the  18  amino  acids  there  are  10 
differences,  so  the  proportion  different  is  10/18  = 0.56. 

To  interpret  these  data,  let  us  suppose  that  amino  acid  replacements  occur 
at  the  rate  X per  unit  time.  Consider  two  independently  evolving  sequences, 
initially  identical,  which  at  time  t are  found  to  differ  in  the  proportion  D,  of 
their  amino  acids.  After  the  next  time  interval,  the  proportion  of  differences 
Dm  is  given  by 


Df+1  = (1  - Df)(2X)  + D, 


8.1 


In  this  equation,  (1  - Df)(2X)  is  the  proportion  of  sites,  previously  identi- 
cal, in  which  one  or  the  other  underwent  an  amino  acid  replacement  during 
the  time  interval  in  question,  which  must  be  added  to  the  already  existing 
differences  D,  in  order  to  give  the  total.  (The  equation  ignores  the  unlikely 
possibility  of  an  amino  acid  replacement  making  two  previously  different 
amino  acid  sites  identical.)  The  factor  of  2 is  present  because  the  total  time  for 
evolution  is  2 1 units  (f  units  in  each  lineage  after  the  split),  which  is  illustrat- 
ed in  Figure  8.3.  Equation  8.1  suggests  the  differential  equation 


dD/dt  = Dm  - Df  = 2X  - 2 XD, 


8.2 


which  has  the  solution 


Dt  = \-e2U 


8.3 
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Figure  8.3  Two  amino  acid  or  nucleotide  sequences  that  have  each  undergone 
independent  evolution  from  a common  ancestor  for  t time  units  are  separated 
by  a total  time  of  2 1 units  because  there  are  t units  in  each  lineage  after  the  split. 
The  proportion  of  sites  that  differ  in  the  sequence  is  denoted  D and  the  total 
number  of  sites  L.  In  this  particular  example,  L = 10  and  D = 3/10. 


An  alternative  argument  can  be  used  to  derive  Equation  8.3  without 
resorting  to  differential  equations.  If  X is  the  rate  of  amino  acid  replacement 
per  unit  time,  then  the  probability  that  a particular  site  remains  unsubsti- 
tuted for  t consecutive  intervals  along  each  of  two  independent  lineages  is 
(1  - X)21,  which  is  approximately  equal  to  e~2Xt,  provided  that  Xt  is  not  too 
large.  Thus,  the  probability  D,  of  one  or  more  replacements  occurring  in  t 
units  of  time  after  divergence  is  approximately  1 - e~2Xf,  which  is  Equation 
8.3. 

Since  X is  the  rate  of  amino  acid  replacement  per  unit  time,  the  expected 
proportion  of  differences  between  two  sequences  at  any  time  t is 

K = 2 Xt  8.4 

where  the  factor  of  2 is  again  present  because  the  total  time  for  evolution  is 
2 1 units  (Figure  8.3). 

Substituting  K from  (8.4)  into  (8.3)  and  rearranging  yields  the  following 
estimate  K of  K,  6 

K = - In  (1  - D)  8.5 

where  D is  the  observed  proportion  of  sites  in  which  two  sequences  differ.  If 
the  sequences  under  comparison  are  L amino  acids  in  length,  then  the  esti- 
mated variance  Var(iC)  of  K is  estimated  from  the  distribution  of  K implied 
by  the  substitution  process  and  is  approximately 

Var(K)  = D/[(l  - D)L] 


8.6 
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The  rate  of  evolution  at  the  molecular  level  is  given  by  the  amount  of 
sequence  divergence  that  occurs  per  unit  of  time.  Thus,  as  suggested  by 
Equations  8.4  and  8.5,  if  two  sequences  are  compared,  and  these  are  known  to 
have  diverged  from  a common  ancestral  sequence  an  estimated  t time  units 
ago,  then  the  rate  of  evolution  X may  be  estimated  as 

X = K/2 1 8.7 

The  units  of  X are  usually  expressed  as  replacements  per  amino  acid  site  (or 
substitutions  per  nucleotide  site)  per  year. 

The  quantity  K is  used  in  preference  to  D in  estimating  the  rate  of  molec- 
ular evolution  because  K takes  multiple  substitutions  into  account.  Over  long 
periods  of  evolutionary  time,  the  amino  acid  present  at  a particular  site  may 
be  replaced  several  times,  first  by  one  alternative,  then  by  another,  then  still 
another,  and  perhaps,  at  some  stage,  even  return  to  the  amino  acid  originally 
present  at  the  site.  When  comparing  two  sequences,  only  the  sites  that  are  dif- 
ferent can  be  identified.  Sites  that  are  identical  at  the  present  time  may 
include  some  that  were  different  in  the  past,  and  sites  that  are  different  at  the 
present  time  might  have  undergone  more  than  one  substitution.  The  quanti- 
ty D is  determined  only  by  the  proportion  of  differences  between  the 
sequences  observed  at  the  present  time.  The  estimate  K makes  a correction 
for  multiple  substitutions,  but  at  the  cost  of  introducing  assumptions  that  the 
substitutions  occur  independently  and  at  the  same  rate  through  time. 

For  relatively  short  intervals  of  evolutionary  time,  during  which  multiple 
substitutions  remain  uncommon,  the  correction  is  minor,  and  the  value  of  K 
is  close  to  that  of  D.  This  can  be  seen  by  the  fact  that  the  initial  slope  of  the 
curve  plotted  in  Figure  8.4  is  1.  As  the  observed  sequence  divergence  increas- 
es, it  becomes  more  likely  that  multiple  hits  have  happened,  so  the  slope 
decreases.  Over  longer  intervals,  when  many  multiple  substitutions  have 
occurred,  the  correction  is  important,  and  the  assumptions  on  which  it  is 
based  must  be  evaluated  critically.  Correction  for  multiple  substitution  events 
is  even  more  important  for  nucleotides  than  it  is  for  amino  acids.  With  amino 
acids,  the  probability  of  a random  replacement  returning  an  amino  acid  site 
to  its  original  identity  is  1/2q  (assuming  equal  frequencies),  whereas  for 
nucleotides  it  is  y4. 


PROBLEM  8.2  Use  the  data  in  the  preceding  example  to  estimate 
the  average  rate  of  amino  acid  replacement  in  the  signal  peptide  of 
y-interferon  during  the  divergence  of  mice  and  humans.  Based  on 
fossil  evidence,  the  separation  of  these  species  occurred  approxi- 
mately 80  million  years  ago. 
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Figure  8.4  As  sequences  become  more  divergent  over  time,  the  number  of 
substitutions  per  site  ( K ) can  continue  to  increase,  but  the  proportion  of  sites 
that  mismatch  in  the  observed  sequences  (D)  saturates. 


ANSWER  For  the  signal  peptide,  D = 0.56  and  K = -ln(l  - 0.56)  = 
0.82.  The  estimated  rate  of  evolution  is  therefore  0.82/[2  x (80  x 106)]  = 
5.1  x 10~9  amino  acid  replacements  per  amino  acid  site  per  year.  The 
standard  deviation  of  K is  estimated  as  equal  to  [0.56/(0.44  x 18)]1/2  = 
0.27.  With  such  a small  sample  size,  the  estimates  could  ordinarily  not 
be  taken  too  literally.  However,  in  this  case,  the  average  rate  for  the 
signal  sequence  is  very  close  to  the  average  rate  for  the  molecule  as  a 
whole.  For  y-interferon,  among  155  amino  acid  sites  there  are  91  dif- 
ferences, giving  K = 0.88  ± 0.22  and  an  average  rate  of  5.5  x 10~9  amino 
acid  replacements  per  amino  acid  site  per  year. 


Rates  of  amino  acid  replacement  vary  over  a 500-fold  range  in  different 
proteins.  The  rate  of  amino  acid  replacement  in  y-interferon  is  one  of  the 
fastest  rates  known  (Li  et  al.  1985).  Among  the  slowest  rates  is  that  of  histone 
H4,  for  which  X = 0.01  x 10  9 per  year.  The  average  rate  among  a large  num- 
ber of  proteins  is  very  close  to  the  rate  found  in  hemoglobin,  which  is  approx- 
imately 1 x 10  9 amino  acid  replacements  per  amino  acid  site  per  year. 

To  be  concrete  about  the  interpretation  of  the  rate  of  amino  acid  replace- 
ment, consider  a protein  exactly  100  amino  acids  in  length,  in  which  the  rate 
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of  amino  acid  replacement  per  amino  acid  site  equals  1.0  x 10  g per  year.  For 
the  entire  protein,  the  rate  of  replacement  equals  100  x 1.0  x 10  M = 1 x 10“' 
per  year.  In  two  different  species,  therefore,  the  protein  would  accumulate 
amino  acid  differences  at  the  rate  of  one  replacement  every  5 million  years 
since  their  divergence  from  a common  ancestor  [because  (5  x 106)  x 2 x 
(1  x 10“7)  = 1.0]. 

The  simple  model  that  we  just  examined  makes  an  assumption  that  is 
violated  by  an  abundance  of  data.  We  assumed  that  all  amino  acid  replace- 
ments occur  with  equal  likelihood.  Besides  the  fact  that  real  proteins  violate 
this  assumption,  we  might  not  have  expected  it  to  be  true,  since  some  amino 
acid  changes  require  a single  underlying  nucleotide  change,  while  others 
require  two  or  even  three  changes.  More  sophisticated  models  for  amino  acid 
sequence  evolution  account  for  these  differences  by  weighting  amino  acid 
changes  with  their  observed  rates  of  change  (Dayhoff  1972;  Jones  et  al.  1992). 

Rates  of  Nucleotide  Substitution 

Nucleotide  sequences  are  analyzed  in  the  same  manner  as  amino  acid 
sequences,  but  the  analogous  equation  to  (8.1)  is  slightly  more  complicated 
because  it  has  to  correct  for  cases  in  which  a substitution  makes  two  previ- 
ously different  nucleotide  sites  identical.  The  correction  is  significant  for 
nucleotide  sequences  because  an  expected  one  third  of  random  substitutions 
will  make  two  previously  different  nucleotides  identical.  The  correction  is 
usually  unnecessary  for  proteins  because  only  y19  of  random  replacements 
make  two  previously  different  amino  acids  identical. 

Several  models  of  nucleotide  substitution  have  been  studied,  which  differ 
primarily  in  the  assumptions  about  rates  of  mutation  between  pairs  of 
nucleotides.  The  simplest  model  is  one  in  which  mutation  occurs  at  a con- 
stant rate,  and  each  nucleotide  is  equally  likely  to  mutate  to  any  other  (Jukes 
and  Cantor  1969).  If  a is  the  rate  of  mutating  from  one  nucleotide  to  a differ- 
ent nucleotide,  then  in  any  time  interval,  A mutates  to  C with  probability  a, 
A mutates  to  T with  probability  a,  and  A mutates  to  G with  probability  a. 
The  probability  that  A does  not  mutate  in  this  interval  is  therefore  1 - 3a. 
The  probability  that  a particular  site  is  A at  time  t + 1 is 

Pa( w)  = (1  - 3a)P/i(f)  + a(l  - PA(t))  8.8 

because  the  first  part  of  the  equation  gives  the  probability  of  having  been  A 
at  time  t and  not  mutating,  and  the  second  part  is  the  probability  of  being 
any  other  nucleotide  and  mutating  to  A.  From  8.8  it  follows  that 

Pa(m)  ~ P A{t)  = dP A{t)/ dt  = -4a  PA(f)  + a 8.9 

Solving  this  differential  equation, 

-4a  t 


P aw  - lU  + 34 e 
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assuming  that  the  initial  state  was  A.  This  is  the  transition  probability  from 
A to  A,  which  we  can  write  as  PAA.  If  we  observe  two  sequences  that  have 
been  separated  for  time  f,  then  the  probability  that  they  continue  to  carry  the 
same  nucleotide  at  a particular  site  is 

Paa  = V. i + 3Ae-8at  8.11 

because  2 1 is  the  total  duration  of  time  along  both  lineages  during  which 
changes  could  occur.  Let  d be  the  proportion  of  nucleotide  sites  that  differ 
between  two  sequences: 

d=l-PAA  8.12 

so 

d = W-e~8at)  8.13 

In  the  previous  symbols,  X is  the  rate  of  mutation  to  a nucleotide  differ- 
ent from  the  current  nucleotide,  so  relating  this  to  a,  we  have  X = 3a.  This 
implies  that  k = 2Xt  = 2(3af)  = 6a t.  Taking  logarithms  of  both  sides  of  Equa- 
tion 8.13,  we  deduce 

8af  = -In  (1 -4<f/3)  8.14 

and,  since  k = 3/4(8a f), 

k = - 3/4  In  (1  - 4rf/3)  8.15 

where  k is  the  expected  proportion  of  nucleotide  sites  that  differ  between  two 
sequences  at  a time  t units  after  their  evolutionary  separation.  By  analogy 
with  protein  evolution,  D is  the  observed  proportion  of  L nucleotide  sites  in 
which  the  sequences  differ.  The  variance  Var  ( k)  of  the  estimate  can  be  esti- 
mated as 

Var  ( k)  = d{  1 - d)/[L(  1 - Ad/3)2]  8.16 

Figure  8.5  shows  the  relationship  between  time  and  d,  and  shows  that  nucle- 
otide sequences  that  follow  the  Jukes-Cantor  pattern  of  mutation  (all 
nucleotides  equally  interchangeable)  approach  an  asymptote,  showing  a 
divergence  of  3/4.  This  makes  intuitive  sense  because,  after  sufficient  time,  the 
common  ancestry  of  the  sequences  has  been  erased,  and  V4  of  the  sites  will 
match  by  chance. 


PROBLEM  8.3  The  coding  region  of  the  trpA  genes  in  strains  of  the 
related  enteric  bacteria  Escherichia  coli  strain  K12  and  Salmonella 
typhimurium  strain  LT-2  were  sequenced  and  compared  (Nichols  and 
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Figure  8.5  Simulations  of  the  substitution  process  for  nucleotide  sequences 
show  that  the  sequence  divergence  saturates  at  d = 0.75.  The  jagged  lines  are 
numerical  simulations  of  a sequence  of  length  1000,  and  the  dots  give  the  pre- 
diction under  the  Jukes-Cantor  model. 


Yanofsky  1979).  The  trpA  gene  codes  for  one  of  the  subunits  of  the  enzyme  trypto- 
phan synthetase  used  in  the  synthesis  of  tryptophan.  Estimate  the  amount  of  nucle- 
otide divergence  k and  amino  acid  divergence  K and  their  standard  deviations. 

K12:  CTC  GCA  CCT  ATC  TTC  ATC  TGC  CCG  CCA  AAT  CCC  GAT  GAC  GAC  CTG  CTG  CGC  CAG  ATA  GCC 

Vat  Ala  Pro  lie  Phe  lie  Cys  Pro  Pro  Asn  Ala  Asp  Asp  Asp  Leu  Leu  Arg  Gin  lie  Ala 

LT2:  ATC  GCG  CCC  ATC  TTC  ATC  TGC  CCG  CCA  AAT  GCC  GAT  GAC  GAT  CTT  CTC  CGC  CAG  GTC  GCA 

lie  Ala  Pro  He  Phe  lie  Cys  Pro  Pro  Asn  Ala  Asp  Asp  Asp  Leu  Leu  Arg  Gin  Val  Ala 


ANSWER  For  the  amino  acid  sequences,  L = 20  and  D = 2/20  = 0.10;  thus  K = 
-ln(0.90)  = 0.105  with  standard  deviation  0.074.  For  the  nucleotide  sequences,  L = 60 
and  d = 9/60  = 0.15;  thus  k = -3/4ln(0.8)  = 0.167  with  standard  deviation  0.058. 
Assuming  that  Escherichia  and  Salmonella  diverged  at  around  the  time  of  the  mam- 
malian radiation  80  million  years  ago,  the  rates  of  evolution  are  0.167/(2  x 80  x 106) 
= 1.04  x 10  q nucleotide  substitutions  per  year  and  0.105/(2  x 80  x 106)  = 0.66  x 10~9 
amino  acid  replacements  per  year.  In  the  gene  as  a whole,  the  values  are  k = 0.300 
for  nucleotide  substitutions  and  K = 0.162  for  amino  acid  replacements. 
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The  Jukes-Cantor  model  assumes  that  all  possible  nucleotide  changes 
occur  at  an  equal  rate.  In  fact,  it  is  generally  observed  from  sequence  com- 
parisons that  transitions,  or  changes  either  from  purine  to  purine  (G<=>A)  or 
from  pyrimidine  to  pyrimidine  (C<=>T)  are  more  frequent  that  transversions 
(the  other  possible  changes).  Kimura  (1980a)  sought  to  accommodate  this 
observation  by  making  a model  with  two  mutation-rate  parameters.  Transi- 
tions occur  with  rate  a and  transversions  occur  with  rate  (3.  The  rate  matrix 
below  shows  the  parameters  of  the  Kimura  two-parameter  model;  as  you 
might  guess,  other  models  can  also  be  specified  by  adding  parameters  to 
this  table.  These  models  can  be  fitted  to  the  data  in  a variety  of  ways,  includ- 
ing solutions  of  the  sort  we  derived  for  the  Jukes-Cantor  model,  as  well  as 
with  more  complex  numerical  methods. 

Rate  matrix  for  the  Kimura  two-parameter  model: 
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Usually  we  have  data  on  more  than  two  sequences,  and  estimates  of 
the  parameters  of  the  substitution  models  are  sought.  If  the  phylogeny  of 
the  organisms  in  the  data  set  is  known,  it  is  possible  to  calculate  the  like- 
lihood of  the  observed  sequences  given  the  phylogeny  and  the  parameters 
in  the  model  (Felsenstein  1981).  Many  advances  have  been  made  in  recent 
years  in  applying  the  method  of  maximum  likelihood  for  estimating  para- 
meters of  the  substitution  process  in  this  context  (Goldman  1993;  Yang 
1996a). 


Other  Measures  of  Molecular  Divergence 

Rates  of  evolution  and  divergence  times  can  be  estimated  from  other  kinds 
of  molecular  data  if  care  is  taken  to  consider  carefully  how  the  process  of 
mutations  results  in  differences  in  the  data  that  are  actually  scored.  For 
example,  Randomly  Amplified  Polymorphic  DNA  (RAPD),  which  is  ana- 
lyzed by  the  polymerase  chain  reaction,  can  be  used  to  estimate  nucleotide 
divergence  only  if  one  can  verify  some  questionable  assumptions  about  how 
PCR  reactions  work  (Clark  and  Lanigan  1993).  VNTR  loci  (loci  that  are  poly- 
morphic due  to  variable  numbers  of  tandem  repeats  generated  by  unequal 
exchanges)  have  a forward  and  back  mutation  pattern  that  results  in  very 
different  population  dynamics.  In  a similar  manner,  microsatellites,  also 
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known  as  STRPs  (short  tandem  repeat  polymorphisms),  undergo  increases 
and  decreases  in  copy  number  such  that  small  changes  in  copy  number  are 
more  common  than  large  changes.  The  result  is  a sort  of  stepwise  mutation 
process,  and  models  of  this  have  yielded  predictions  about  patterns  of 
microsatellite  variability  that  are  roughly  concordant  with  observations 
(Zhivotovsky  and  Feldman  1995). 


THE  MOLECULAR  CLOCK 

Although  the  rate  of  nucleotide  substitution  and  amino  acid  replacement 
varies  among  different  genes,  the  average  rate  of  molecular  evolution  can  be 
rather  uniform  throughout  long  periods  of  evolutionary  time.  Such  unifor- 
mity in  the  rate  of  amino  acid  replacement  or  nucleotide  substitution,  first 
noted  by  Zuckerkandl  and  Pauling  (1962),  is  known  as  a molecular  clock. 

An  example  of  the  approximate  uniformity  in  amino  acid  substitutions  is 
illustrated  in  the  evolution  of  the  a-globin  gene  in  the  organisms  depicted  in 
the  phylogenetic  tree  in  Figure  8.6.  The  data  are  summarized  in  Table  8.1.  The 
numbers  above  the  diagonal  are  the  percent  amino  acid  differences 
(D  x 100)  between  the  a-globin  sequences.  For  example,  the  a-globin  genes  of 
dog  and  human  differ  in  16.3%  of  their  amino  acid  sites;  since  mammalian  a- 
globin  contains  141  amino  acids,  this  percentage  corresponds  to  23  sites  in 
which  the  amino  acids  differ.  The  percentages  exclude  differences  that  result 
from  the  insertion  or  deletion  of  amino  acids,  which  are  called  gaps  in 
sequence  comparisons.  For  example,  the  comparison  between  human  and 
shark  a-globin  is  based  on  139  amino  acid  sites  that  are  homologous,  and 
excludes  gaps  amounting  to  11  additional  amino  acid  sites.  Missing  from  Fig- 
ure 8.6  are  plants,  which  (remarkably)  also  have  sequences,  known  as  leghe- 
moglobin,  that  show  significant  homology  to  vertebrate  globins  (Landsmann 
et  al.  1986). 

Beneath  the  diagonal  in  Table  8.1  are  the  estimated  proportions  of  dif- 
ferences per  amino  acid  site,  calculated  from  Equation  8.5  as  K = -ln(l  - D). 
The  table  also  gives  the  average  value  of  K in  all  comparisons  with  the  shark, 
carp,  newt,  chicken,  echidna,  kangaroo,  and  dog,  respectively,  and  the  diver- 
gence times  from  the  bifurcations  in  Figure  8.6. 

The  average  proportion  of  differences  per  site  is  plotted  against  diver- 
gence time  in  Figure  8.7.  The  very  close  fit  to  a straight  line  is  evident.  Since 
the  divergence  time  is  exactly  half  of  the  total  time  available  for  evolution 
(Figure  8.6),  the  rate  of  evolution  X can  be  estimated  as  one-half  times  the 
slope  of  the  line  in  Figure  8.7.  For  these  data,  the  slope  is  1.8  x 10  9,  and  there- 
fore K = 0.9  x 10“9  amino  acid  replacements  per  amino  acid  site  per  year.  The 
good  fit  of  the  points  to  the  straight  line  indicates  that  the  actual  rate  of  a- 
globin  evolution  has  deviated  little  from  the  average  for  the  past  450  million 
years. 
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TABLE  8.1  RATE  OF  EVOLUTION  IN  THE  a-GLOBIN  GENE 

Shark 

Carp 

Newt 

Chicken 

Echidna 

Kang 

Dog 

Human 

Shark 

59.4 

61.4 

59.7 

60.4 

55.4 

56.8 

53.2 

Carp 

0.90 

53.2 

51.4 

53.6 

50.7 

47.9 

48.6 

Newt 

0.95 

0.76 

44.7 

50.4 

47.5 

46.1 

44.0 

Chicken 

0.91 

0.72 

0.59 

34.0 

29.1 

31.2 

24.8 

Echidna 

0.93 

0.77 

0.70 

0.42 

34.8 

29.8 

26.2 

Kang 

0.81 

0.71 

0.64 

0.34 

0.43 

23.4 

19.1 

Dog 

0.84 

0.65 

0.62 

0.37 

0.35 

0.27 

16.3 

Human 

0.76 

0.67 

0.58 

0.28 

0.30 

0.21 

.018 

Avg  k 

0.87 

0.71 

0.63 

0.35 

0.36 

0.24 

0.18 

Time 

450 

410 

360 

290 

225 

135 

80 

(Percentage  data  from  Kimura  1983.) 

Note:  Values  above  the  diagonal  are  the  observed  percent  amino  acid  differences  (D) 
between  the  a-globin  sequences  in  the  species,  values  in  boldface  are  the  expected 
amino  acid  differences  per  site  [K  = -ln(l  - D)j.  Average  values  of  K and  the  esti- 
mated times  of  divergence  (in  millions  of  years)  are  given  at  the  bottom  of  the  table. 
Abbreviation:  Kang,  kangaroo. 


Figure  8.7  Relation  between  estimated  number  of  amino  acid  substitutions  in 
a-globin  ( K ) between  pairs  of  the  vertebrate  species  in  Figure  8.6,  against  time 
since  each  pair  diverged  from  a common  ancestor.  The  straight  line  is  expected 
based  on  a uniform  rate  of  amino  acid  substitution  during  the  entire  period. 
(From  Kimura  1983.) 
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PROBLEM  8.4  The  (3-globin  molecule  in  primates  contains  146 
amino  acids,  and  estimates  of  the  number  of  amino  acid  differences 
among  various  primates  are  tabulated  below  (data  from  Kimura 
1983).  Calculate  the  average  rate  of  evolution  of  (3-globin  molecule  in 
primates.  (Hint:  First  calculate  D and  K for  each  species  pair,  then  plot 
the  points  with  time  on  the  x axis  and  D on  the  y axis.  Finally,  do  a lin- 
ear regression  to  estimate  the  average  rate  of  substitution.) 


Time  of  divergence 
(millions  of  years) 

Average  number  of 
amino  acid  differences 

85 

25.5 

60 

24.0 

42 

6.25 

40 

6.0 

30 

2.5 

15 

1.0 

ANSWER  D values  are  obtained  by  dividing  each  number  of  amino 
acid  differences  by  146,  and  average  values  of  K are  estimated  as 
-ln(l  - D ).  The  average  K values,  from  top  to  bottom,  are  0.192, 0.180, 
0.044, 0.042, 0.018, 0.007,  respectively.  These  are  the  y values  in  the  lin- 
ear regression,  and  the  x values  are  the  divergence  times.  Altogether 
there  are  n = 6 points.  In  this  case,  L(xy)  = 3.1263  x 107,  £(x)  = 2.72  x 
108,  E(y)  = 0.482,  and  Lfx2)  = 1.5314  x 10lb.  The  slope  of  the  regression 
is  3.15  x 1CT9,  and  the  rate  of  evolution  is  half  of  this,  or  1.58  x 10~9 
amino  acid  replacements  per  amino  acid  site  per  year.  This  estimate  is 
reasonably  close  to  the  value  of  0.9  x 10~9  per  year  calculated  for  a- 
globin.  (Note:  Rather  than  calculate  K from  the  average  number  of 
amino  acid  differences,  it  would  be  more  accurate  to  calculate  K for 
each  species  comparison  and  then  take  the  average;  however,  in  this 
example,  it  makes  very  little  difference.) 


Variation  across  Genes  in  the  Rate  of  the  Molecular  Clock 

If  an  organism  has  a particular  rate  of  mutation  in  its  genome,  one  might 
think  at  first  that  the  rate  at  which  the  molecular  clock  runs  would  be  the 
same  for  all  genes.  But  the  neutral  theory  predicts  that  the  rate  of  molecular 
evolution  should  depend  on  the  neutral  mutation  rate,  which  may  be  quite  a 
bit  lower  than  the  overall  mutation  rate,  and  may  vary  widely  across  genes. 
Figure  8.8  shows  that  three  different  proteins  in  the  same  organisms  have 
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Figure  8.8  The  molecular  clock  runs  at  different  rates  in  different  proteins. 
One  reason  is  that  the  neutral  substitution  rate  differs  among  proteins.  Fibrino- 
gen appears  to  be  relatively  unconstrained  and  has  a high  neutral  substitution 
rate,  while  cytochrome  c has  a lower  neutral  substitution  rate,  and  may  be  more 
constrained.  Data  are  from  a wide  variety  of  organisms.  (From  Dickerson  1971.) 


widely  differing  molecular  clock  rates.  Nevertheless,  within  each  gene,  we 
observe  reasonably  uniform  rates  of  change.  The  variation  across  genes 
appears  to  be  due  to  the  fact  that  some  proteins  are  highly  tolerant  of  substi- 
tutions, whereas  others  suffer  deleterious  effects  from  even  one  or  a few 
minor  changes.  Genes  whose  function  is  well  buffered  from  the  environment 
generally  have  a slower  rate  of  substitution  than  genes  whose  products  have 
a premium  on  variability.  The  extremes  are  represented  by  histone  F14,  at  the 
low  end,  and  '/-interferon,  at  the  high  end,  with  globin  proteins  near  the  mid- 
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die  of  the  spectrum.  In  short,  the  molecular  clocks  for  different  genes  "tick" 
at  different  rates. 

In  addition  to  functional  constraints  affecting  substitution  rate,  the  pat- 
tern of  hereditary  transmission  also  affects  substitution  rate.  Organelle 
genomes  are  replicated  and  transmitted  in  a manner  distinct  from  nuclear 
genes,  so  it  may  not  be  surprising  that  they  undergo  substitutions  with  dif- 
ferent dynamics.  Mitochondrial  DNA  exhibits  wide  variation  in  substitution 
rates  across  its  relatively  tiny  genome,  but  in  animals  the  substitution  rate  is 
generally  much  higher  than  the  substitution  rate  of  chromosomal  genes.  In 
plants,  on  the  other  hand,  comparisons  among  nucleotide  substitution  rates 
of  nuclear  DNA,  chloroplasts,  and  mitochondria  reveal  clear  differences,  with 
mtDNA  showing  less  than  one-third  the  substitution  rate  of  chloroplast 
DNA,  which  in  turn  has  about  half  the  substitution  rate  of  nuclear  genes 
(Wolfe  et  al.  1987).  In  general,  genes  on  the  X chromosome  have  a lower  rate 
of  substitution  than  do  genes  on  autosomes  (Miyata  et  al.  1987).  A higher  rate 
of  mutation  in  males  (Shimmin  et  al.  1993)  would  lower  the  X-chromosome 
rate  because  the  X chromosome  spends  more  time  in  females.  But  the  substi- 
tution rate  for  Y-linked  genes  is  indistinguishable  from  that  of  autosomes 
(McVean  and  Hurst  1997),  which  suggests  that  the  mutation  rate  is  equal  in 
both  sexes  but  lower  in  X-linked  genes  than  in  autosomal  genes. 

Not  only  do  substitution  rates  vary  from  one  gene  to  another,  but  they 
also  vary  widely  across  sites  within  each  gene!  If  all  sites  did  undergo  sub- 
stitution at  the  same  rate,  then  the  number  of  substitutions  per  site  should 
have  a Poisson  distribution.  Fitch  and  Margoliash  (1967)  noticed  that  the 
cytochrome  c data  did  not  fit  this  model  unless  invariant  and  hypervariable 
sites  were  excluded.  The  models  that  we  have  developed  so  far  assume  that 
all  sites  evolve  in  the  same  way,  so  to  accommodate  this  variability  (and  to 
test  for  how  different  the  rates  are)  models  that  specifically  incorporate  rate 
variation  must  be  developed.  One  convenient  model  is  to  assume  that  the 
rates  vary  according  to  a gamma  distribution  (Golding  1983;  Wakeley  1993). 
Yang  (1996b)  reviews  estimates  of  the  rate-variation  parameter  of  the  gamma 
distribution,  and  finds  that  all  17  cases  examined  show  significant  among- 
site  variation  in  substitution  rate. 

Variation  across  Lineages  in  Clock  Rate 

The  neutral  theory  predicts  that  the  rate  of  the  molecular  clock  should  run  at 
different  rates  for  different  organisms  having  different  neutral  mutation 
rates.  The  range  of  mutation  rates  is  impressive.  Figure  8.9  shows  the  num- 
ber of  nucleotide  differences  observed  in  the  influenza  NS  genes,  plotted 
against  the  year  of  isolation  of  the  virus  containing  them.  The  rate  of  gene 
substitution  averages  X = 1.94  ± 0.09  x 10‘3  nucleotide  substitutions  per 
nucleotide  site  per  year.  Although  the  rate  of  gene  substitution  is  about 
10h-fold  faster  than  observed  in  germline  genes  in  eukaryotes,  it  is  neverthe- 
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Year  of  isolation 

Figure  8.9  Molecular  evolution  in  the  NS  genes  of  influenza  virus  determined 
from  strains  isolated  and  stored  during  the  past  60  years.  The  total  rate  of  evolu- 
tion in  the  890-nucleotide  sequence  averages  1.73  ± 0.08  nucleotide  substitutions 
per  year,  and  the  rate  is  remarkably  uniform.  (From  Buonagurio  et  al.  1986.) 


less  approximately  constant  during  the  period  available  for  study.  The  extra- 
ordinary rate  of  evolution  in  influenza  virus  is  thought  to  be  related  to  a high 
rate  of  spontaneous  mutation  resulting  from  errors  in  replication  (Holland  et 
al.  1982).  As  in  many  other  RNA-based  viruses,  the  RNA  replicase  enzyme 
that  replicates  the  influenza  genome  lacks  a proofreading  function.  Rapid 
rates  of  gene  substitution  can  be  of  immense  medical  significance.  Yokoyama 
et  al.  (1988)  estimated  the  rate  of  substitution  in  the  pol  gene  of  the  human 
immunodeficiency  virus  as  0.5  x 10~3  per  nucleotide  site  per  year.  The  time 
of  divergence  between  HIV1  and  HIV2  was  estimated  at  just  200  years  ago, 
and  the  bulk  of  the  genetic  variability  among  recently  isolated  strains  of 
HI VI  has  been  generated  in  the  last  20  years. 

The  rate  of  the  molecular  clock  also  varies  among  taxonomic  groups  (Brit- 
ten 1986).  For  example,  the  insulin  gene  evolved  much  more  rapidly  in  the 
evolutionary  line  leading  to  the  guinea  pig  than  in  other  evolutionary  lines 
(King  and  Jukes  1969),  and  the  C-type  viral  sequences  integrated  into  the  pri- 
mate genome  evolved  at  twice  the  rate  in  Asian  primates  as  in  African  pri- 
mates (Benveniste  1985).  Figure  8.10  illustrates  another  example  of  a 
retardation  in  the  clock  in  one  lineage.  Such  departures  from  constancy  of  the 
clock  rate  pose  a problem  in  using  molecular  divergence  to  date  the  times  of 
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Figure  8. 1 0 Gene  genealogy  of  Drosophila  Adh  sequences  showing  a signifi- 
cant slow-down  of  substitutions  in  the  pseudoobscura  clade.  (After  Takezaki  et  al. 
1995.) 


existence  of  most  recent  common  ancestors.  Before  this  inference  can  be  jus- 
tified, one  needs  to  know  that  the  set  of  species  one  is  examining  have  a uni- 
form clock. 


PROBLEM  8.5  The  simplest  way  to  test  whether  substitutions  have 
occurred  at  the  same  rate  in  different  organisms  is  to  consider  a tree 
like  that  in  Figure  8.11.  We  expect  that  the  divergence  between  A and 
C should  be  the  same  as  the  divergence  between  B and  C if  the  clock 
is  uniform  on  all  branches.  Tests  of  this  hypothesis  are  known  as  rela- 
tive rate  tests.  Any  site  that  underwent  a substitution  along  the 
branch  from  X to  C (but  not  on  the  other  branches)  will  have  the  prop- 
erty that  A=  B*C.  Sites  that  underwent  a substitution  on  the  branch 
from  X to  B (but  not  the  other  branches)  will  show  A= C * B.  Tajima 
(1993)  showed  that  a simple  and  robust  relative  rate  test  could  be  per- 
formed by  simply  doing  a chi-square  test  of  the  null  hypothesis  that 
the  numbers  of  these  two  kinds  of  sites  are  equal.  Suppose  we  observe 
sequences  as  follows: 

A ATG  CTA  GCA  TGC  ATG  CTA  GC 

B ATC  CTA  GCA  TCC  ATG  GTA  GT 

C ATG  CTA  TCA  TGC  TTG  GTA  GC 
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Figure  8.1 1 A simple  tree  for  illustrating  the  relative  rate  test  of  Tajima  (1993). 


Calculate  the  observed  and  expected  numbers  of  sites  in  the  two  cat- 
egories (A=  B*C  and  A=C  *B),  and  calculate  the  chi-square  statistic 
to  determine  whether  they  are  equal. 


AN SWER  The  observed  number  of  sites  for  which  A=B*C  is  2,  and 
for  A = C * B there  are  3 sites.  Sites  where  A = B = C or  A * B * C are 
ignored  in  this  test.  The  expected  number  of  sites  of  the  two  types  is 
each  (2  + 3)/2,  so  the  chi-square  tests  gives  (2  - 2.5)2/2.5  + 
(3  - 2.5)2/2.5  = 0.2,  which  is  clearly  not  significant.  This  example  had 
insufficient  data  for  an  adequate  test,  but  it  provides  an  example  in 
which  there  is  no  evidence  for  significant  difference  in  rates.  A more 
flexible  but  more  involved  test,  based  on  maximum  likelihood,  can  be 
found  in  Muse  and  Weir  (1992). 


The  Generation-Time  Effect 

One  observed  feature  of  molecular  evolutionary  clocks  is  that  their  rate  is 
approximately  constant  in  a time  scale  measured  in  years.  This  is  quite  unex- 
pected because  mutation  rates  are  thought  to  be  more  nearly  constant  when 
measured  in  generations.  However,  the  appropriate  time  scale  of  molecular 
evolution  is  not  completely  settled  (Easteal  1985),  as  there  is  some  evidence 
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that  the  rate  of  synonymous  substitution  in  genes  in  the  rodent  lineage  (short 
generation  time)  might  be  about  two  times  as  rapid  as  occurs  in  the  same 
genes  in  the  human  lineage  (Wu  and  Li  1985;  Li  and  Wu  1987).  Evidence 
from  immunoglobulin  genes  further  suggests  that  among  mammals,  the  pri- 
mate lineage  has  the  slowest  rate  of  nucleotide  substitution  (Sakoyama  et  al 
1987). 

Even  if  true,  a nearly  constant  rate  of  gene  substitution  per  year  is  not  nec- 
essarily in  conflict  with  a constant  rate  of  neutral  mutation  per  generation. 
The  reason  is  that  organisms  with  short  generation  times  tend  to  be  small 
and  to  maintain  large  population  sizes.  In  such  organisms,  the  proportion  of 
nearly  neutral  mutations  will  be  reduced  because  effective  neutrality  requires 
that  Ns  « 1,  where  s is  the  selection  coefficient  against  the  mutation.  How- 
ever, the  smaller  proportion  of  nearly  neutral  mutations  in  these  organisms  is 
offset  against  the  occurrence  of  more  mutations  per  unit  time  than  in  larger 
organisms,  because  the  generation  time  is  shorter.  Thus,  the  effects  of  short 
generation  time  and  larger  population  size  act  in  opposite  directions  and 
tend  to  cancel  out  (Crow  1985). 

Does  the  Constancy  of  Substitution  Rates  Prove  the  Neutral  Theory? 

The  possibility  that  gene  substitutions  might  occur  at  an  approximately  con- 
stant rate  gave  some  credence  to  the  simplest  version  of  the  neutral  theory. 
Theoretical  principle  2,  discussed  earlier  in  this  chapter,  states  that  the 
expected  rate  of  substitution  of  neutral  alleles  equals  the  rate  of  mutation  p 
to  neutral  alleles.  Therefore,  on  the  face  of  it,  the  occurrence  of  molecular 
clocks  would  seem  to  support  the  neutral  theory.  But  when  we  dig  a bit 
deeper  into  the  predictions  of  the  molecular  clock,  we  find  that  things  are  not 
necessarily  so  simple. 

In  a theoretically  perfect  molecular  clock  driven  by  a random  process 
identical  to  that  of  radioactive  decay  (a  Poisson  process),  the  variance  in  the 
rate  of  ticking  would  be  equal  to  the  average  rate  of  ticking.  Tests  based  on 
the  number  of  substitutions  between  pairs  of  species  in  three  proteins 
showed  that  the  variance  was  significantly  larger  than  the  mean  (Ohta  and 
Kimura  1971).  Langley  and  Fitch  (1974)  backed  this  up  by  an  analysis  in 
which  they  estimated  the  number  of  substitutions  on  each  branch  of  the  phy- 
logenetic tree,  and  compared  the  mean  and  variance  of  these  counts  for  each 
branch.  Again,  there  was  a highly  significant  excess  variance.  Gillespie  (1989) 
examined  the  ratio  R of  the  variance  to  the  mean  number  of  substitutions  in  a 
set  of  four  nuclear  and  five  mitochondrial  genes  in  mammals,  and  found  that 
R ranged  from  0.16  to  35.55.  (The  value  of  35.55  is  for  cytochrome  oxidase  II, 
which  shows  65  amino  acid  differences  between  human  and  mouse,  61  dif- 
ferences between  human  and  cow,  and  only  21  differences  between  mouse 
and  cow.)  Gillespie  argued  that  the  large  range  of  R implies  a sixfold  differ- 
ence among  mammalian  lineages  in  rates  of  nucleotide  substitution.  This 
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excess  variance  in  substitution  rate  has  been  called  an  "episodic  clock,"  char- 
acterized by  periods  of  stasis  alternating  with  periods  of  rapid  substitution. 

Why  does  the  clock  appear  to  be  episodic?  One  possible  reason  is  that  the 
substitution  process  is  not  really  a simple  Poisson  process.  If  instead  the  rate 
itself  changes  in  a random  or  stochastic  manner,  the  data  could  be  fitted 
much  better.  Such  a process,  where  the  substitution  rate  for  a Poisson  process 
is  itself  stochastic,  is  called  a doubly  stochastic  process,  and  it  does  indeed 
seem  to  fit  the  data  better  (Gillespie  1991).  Such  a compound  Poisson  process 
ought  to  show  clusters  of  rapid  change  separated  by  periods  of  relative  qui- 
escence, a pattern  that  is  generally  supported  by  the  data  (Gingerich  1986; 
Gillespie  1989, 1991).  One  means  of  causing  variation  in  the  substitution  rate 
is  natural  selection  in  a stochastically  varying  environment,  and  such  models 
can  also  fit  the  data  satisfactorily  (Gillespie  1986).  Takahata  (1987)  has  argued 
that  the  variance  can  be  inflated  by  a "fluctuating  neutral  space"  model,  in 
which  changes  in  selective  constraints  among  lineages  result  in  variation  in 
substitution  rate  among  lineages.  The  dynamics  of  substitutions  are  suffi- 
ciently complicated  that  a wide  range  of  models  can  fit  the  data,  but  for  now, 
one  thing  we  are  sure  of  is  that  the  simplest  Poisson  process  is  not  adequate. 


PATTERNS  OF  NUCLEOTIDE  AND  AMINO  ACID  SUBSTITUTION 

We  have  now  seen  several  examples  illustrating  the  general  principle  that 
nucleotide  substitutions  occur  at  a greater  rate  than  amino  acid  replace- 
ments. The  difference  in  rates,  sometimes  much  greater  than  in  these  data, 
results  from  redundancy  in  the  genetic  code.  As  illustrated  in  Table  8.2,  the 
codons  for  eight  amino  acids  contain  N (standing  for  any  nucleotide)  in  their 
third  position,  seven  terminate  in  Y (any  pyrimidine,  which  means  T or  C), 
and  five  terminate  in  R (any  purine,  which  means  A or  G).  Coding  sites  con- 
taining an  N are  called  fourfold  degenerate  sites  because  any  of  the  four 
nucleotides  will  do,  and  those  containing  a Y or  R are  twofold  degenerate 
sites  (Li  et  al.  1985).  Because  of  degeneracies,  nucleotides  in  a gene  can 
change  without  affecting  the  amino  acid  sequence.  These  changes  are  called 
synonymous  or  silent  nucleotide  substitutions.  Nucleotide  substitutions  that 
do  change  amino  acids  are  nonsynonymous  substitutions. 

Calculating  Synonymous  and  Nonsynonymous  Substitution  Rates 

In  calculations  involving  synonymous  and  nonsynonymous  nucleotide  sites, 
the  total  number  of  synonymous  sites  is  calculated  as  the  number  of  fourfold 
degenerate  sites  plus  one-third  of  the  number  of  twofold  degenerate  sites. 
The  total  number  of  nonsynonymous  sites  in  a coding  region  is  defined  as 
the  number  of  nondegenerate  sites  (nucleotides  in  which  any  change  results 
in  an  amino  acid  substitution),  plus  two-thirds  of  the  number  of  twofold 
degenerate  sites  (the  latter  because,  with  random  mutation  at  twofold 
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TABLE  8.2  DEGENERACY  IN  THE  GENETIC  CODE 


Second  nucleotide  in  codon 


r 

C 

A 

G 

T 

TTY  Phe 

TTR  Leu 

TC N Ser 

TAY  Tyr 

TAH  Stop 

TGY  Cys 

TGA  Stop 
TGG  Trp 

C 

CT N Leu 

CCA/Pro 

CAY  His 

CAR  Gin 

C GN  Arg 

A 

AT H lie 

ATG  Met 

AC N Thr 

AAY  Asn 

AAR  Lys 

AGY  Ser 

AGP  Arg 

G 

GTN  Val 

GC1V  Ala 

GAY  Asp 

GAR  Glu 

GGN  Gly 

Ste'/|n^hlAS  reP*;fel)tatlon  of  the  standard  genetic  code,  the  symbol  N stands  for  any  nucleo- 

tide (1C  A,  or  G),  the  symbol  Y for  any  pyrimidine  (T  or  C),  and  the  symbol  R for  any 
purine  (A  or  G).  The  H in  the  set  of  codons  for  isoleucine  (lie)  stands  for  "not-G"  (T  C or  A) 
Degeneracies  are  as  follows:  N represents  a fourfold  degenerate  site,  Y and  R represent 
twofold  degenerate  sites.  The  H in  the  set  of  codons  for  isoleucine  is  considered  as  twofold 
degenerate,  as  are  the  first  nucleotides  in  four  leucine  codons  (TTA,  TTG  CTA  and  CTG)  and 
four  arginine  codons  (CGA,  CGG,  AGA,  and  AGG).  All  other  nucleotides  are  nondegenerate. 


degenerate  sites,  two-thirds  of  the  mutations  are  expected  to  result  in  amino 
acid  changes).  These  conventions  are  illustrated  above. 


PROBLEM  8.6  For  the  sequences  of  the  region  of  the  trpA  gene 
given  earlier,  calculate  the  synonymous  and  nonsynonymous  substi- 
tution rates.  Start  by  using  Table  8.2  to  assign  degeneracy  classes  to 
each  site.  For  each  difference  between  E.  coli  and  Salmonella,  the  dif- 
ference is  synonymous  either  if  the  site  is  fourfold  degenerate  or  if  it 
is  twofold  degenerate  and  the  change  is  a transition  (that  is,  A to  G or 
the  reverse,  or  T to  C or  the  reverse).  The  difference  is  nonsynony- 
mous either  if  the  site  is  nondegenerate  or  if  it  is  twofold  degenerate 


340  Chapter  8 


and  the  change  is  a transversion  (that  is,  A or  G to  T or  C).  Equation  8.15  is  used  to 
estimate  the  proportion  of  nonsynonymous  nucleotide  substitutions  per  nonsyn- 
onymous  site  and  the  proportion  of  synonymous  substitutions  per  synonymous 
site.  The  degeneracy  assignments  are  therefore  as  follows: 


004 

004 

004 

002 

002 

002 

002 

004 

004 

002 

004 

002 

002 

002 

204 

204 

004 

002 

002 

004 

K12: 

GTC 

GCA 

CCT 

ATC 

TTC 

ATC 

TGC 

CCG 

CCA 

AAT 

GCC 

GAT 

GAC 

GAC 

CTG 

CTG 

CGC 

CAG 

ATA 

GCC 

Val 

Ala 

Pro 

He 

Phe 

He 

Cys 

Pro 

Pro 

Asn 

Ala 

Asp 

Asp 

Asp 

Leu 

Leu 

Arg 

Gin 

He 

Ala 

LT2: 

ATC 

GCG 

CCG 

ATC 

TTC 

ATC 

TGC 

CCG 

CCA 

AAT 

GCG 

GAT 

GAC 

GAT 

CTT 

CTG 

CGC 

CAG 

GTC 

GCA 

He 

Ala 

Pro 

He 

Phe 

He 

Cys 

Pro 

Pro 

Asn 

Ala 

Asp 

Asp 

Asp 

Leu 

Leu 

Arg 

Gin 

Val 

Ala 

N 

S 

S 

S 

S 

S 

N N 

S 

ANSWER  The  stars  above  indicate  differences  with  Salmonella  and  the  letters 
below  indicate  which  changes  are  nonsynonymous  (N)  and  which  are  synonymous 
(S).  Altogether  there  are  38  nondegenerate  sites,  12  twofold  degenerate  sites,  and  10 
fourfold  degenerate  sites.  The  total  number  of  nonsynonymous  sites  is  38  + (2/3)12 
= 46,  and  the  total  number  of  synonymous  sites  is  10  + (1/3)12  = 14.  There  are  three 
nonsynonymous  changes  (D  = 3/46  = 0.065)  and  six  synonymous  changes  (D  = 
6/14  = 0.429).  Now  we  use  Equation  8.15  to  estimate  the  proportion  of  nonsynony- 
mous nucleotide  substitutions  per  nonsynonymous  site  and  the  proportion  of  syn- 
onymous substitutions  per  synonymous  site.  The  number  of  nonsynonymous 
nucleotide  substitutions  per  nonsynonymous  site  is  k = 0.068,  and  the  number 
of  synonymous  nucleotide  substitutions  per  synonymous  site  is  k = 0.635. 


Estimates  of  synonymous  and  nonsynonymous  substitution  rates  for  a 
mammalian  protein-encoding  gene  are  plotted  in  Figure  8.12.  A striking 
observation  is  that  the  synonymous  rates  are  generally  much  greater  than 
the  rates  of  substitution  at  nonsynonymous  sites.  These  rates  are  scaled,  so 
that  if  all  mutations  were  equally  likely  to  go  to  fixation,  the  rates  would  be 
equal.  The  depression  in  nonsynonymous  substitution  rate  is  interpreted  as 
being  caused  by  natural  selection  eliminating  those  changes  that  are  delete- 
rious. There  also  appears  to  be  greater  variability  of  nonsynonymous  rates 
than  there  is  in  the  synonymous  rates,  although  even  the  latter  vary  by  more 
than  twofold.  Figure  8.13  shows  that  the  two  rates  are  correlated,  suggesting 
that  either  the  mutation  rates  vary  from  gene  to  gene  or  that  the  constraints 
on  nonsynonymous  sites  are  somehow  correlated  with  those  on  synony- 
mous sites.  We  shall  see  how  this  correlation  might  arise  at  the  end  of  this 
section. 
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Figure  8.1 2 Synonymous  sites  and  nonsynonymous  sites  in  (3-globin  undergo 
substitutions  at  different  rates,  but  to  a first  approximation,  both  may  appear  to 
exhibit  a clocklike  substitution  process.  (From  Li  et  al.  1985a.) 


One  problem  that  may  be  apparent  with  the  above  method  for  counting 
synonymous  and  nonsynonymous  sites  is  that  the  status  of  a particular  site 
may  change  during  evolution.  The  reason  is  that  changes  elsewhere  in  the 
codon  may  make  a site  that  was  formerly  four-fold  degenerate  now  become 
two-fold  degenerate.  In  fact,  the  way  the  sites  are  tallied  depends  on  the 
order  in  which  they  are  considered.  Another  way  to  calculate  nonsynony- 
mous and  synonymous  substitution  rates  is  to  consider  each  codon  and 
count  the  number  of  changes  that  occurred.  For  codons  that  changed  at  a sin- 
gle site,  the  change  is  scored  as  synonymous  if  there  was  no  alteration  in  the 
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Figure  8.1 3 Plotting  the  data  of  Figure  8.12  in  another  way,  the  relative  rates 
of  synonymous  and  nonsynonymous  substitutions  vary  somewhat,  but  in  all 
cases  synonymous  rates  are  lower.  (Data  from  Li,  et  al.  1985a.) 


resulting  amino  acid  sequence,  and  nonsynonymous  if  there  was  an  alter- 
ation. When  there  are  two  differences  in  a codon,  then  it  is  necessary  to  con- 
sider both  orders  of  occurrence,  and  if  we  have  no  reason  to  assume  one 
order  is  more  likely,  then  both  are  considered  equally  likely.  The  two  orders 
may  have  differing  numbers  of  synonymous  and  nonsynonymous  changes. 
For  example,  if  a codon  changes  from  CCG  (proline)  to  AGG  (arginine),  it 
could  have  done  so  either  through  CCG->ACG  (threonine)->AGG  or 
through  CCG— >CGG(arginine)— >AGG.  The  first  possibility  entails  two  non- 
synonymous changes,  whereas  the  second  entails  only  one.  If  there  are  three 
changes  in  a codon,  there  are  six  possible  orders  in  which  they  might  have 
occurred.  This  all-possibilities  method,  by  Nei  and  Gojobori  (1986),  seems 
like  an  improvement,  but  actually  the  estimates  come  out  to  be  very  similar 
to  the  method  in  Problem  8.6.  Furthermore,  even  this  method  does  not  avoid 
the  problems  of  sites  changing  status  due  to  flanking  changes.  Far  more  com- 
plicated models  are  needed  to  fully  avoid  this  problem,  but  in  the  end,  the 
estimates  that  they  give  are  also  very  similar  to  the  simplest  method  outlined 
in  Problem  8.6  (Muse  and  Gaut  1994;  Goldman  and  Yang  1994). 
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Paralleling  the  evolutionary  rates  for  amino-acid-changing  substitutions, 
the  rates  of  nonsynonymous  nucleotide  substitution  vary  tremendously 
among  different  proteins.  Among  the  slowest  rates  is  that  of  histone  H4,  for 
which  k = 0.004  x 10  9 substitutions  per  nonsynonymous  nucleotide  site  per 
year,  and  among  the  fastest  is  that  of  y-interferon,  for  which  k = 2.80  x 10-9 
substitutions  per  nonsynonymous  nucleotide  site  per  year.  The  average  rate 
among  a large  number  of  proteins  is  very  close  to  the  rate  found  in  hemoglo- 
bin, which  is  0.87  x 10  substitutions  per  nonsynonymous  nucleotide  site  per 
year  (Figure  8.14).  As  in  the  examples  given  here,  rates  of  nonsynonymous 
nucleotide  substitution  are  usually  quite  similar  to  the  rates  of  amino  acid 
replacement  in  the  same  genes. 

In  contrast  with  the  highly  variable  rates  of  nonsynonymous  nucleotide 
substitutions  among  proteins,  the  rates  of  synonymous  substitution  are  much 
more  uniform.  For  example,  in  mammalian  genes,  the  fastest  rate  of  synony- 
mous substitution  is  only  3 to  4 times  greater  than  the  slowest  rate  (see  Figure 
8.14).  However,  the  average  rate,  k = 4.7  x KT9  substitutions  per  synonymous 
site  per  year,  is  not  only  greater  than  the  average  rate  of  nonsynonymous 
substitutions,  but  it  is  greater  than  the  fastest  known  rate  of  nonsynonymous 
substitutions  (for  y-interferon). 
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Figure  8. 1 4 Comparison  of  rates  of  synonymous  and  nonsynonymous  nucle- 
otide substitutions.  Synonymous  rates  are  generally  much  faster  and  much 
more  uniform  than  nonsynonymous  rates.  (From  Kimura  1986.) 
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The  great  variability  among  proteins  in  the  rate  of  nonsynonymous  nucle- 
otide substitution,  when  contrasted  with  the  much  smaller  variability  found 
in  the  rate  of  synonymous  substitutions,  is  illustrated  graphically  in  Figure 
8.14.  This  disparity  has  been  used  as  evidence  in  favor  of  the  neutral  theory. 
Interpreted  according  to  the  neutral  theory,  the  variation  in  rates  occurs 
because  there  are  selective  constraints  on  amino  acid  substitutions  that  do  not 
operate  as  strongly  on  synonymous  nucleotide  substitutions.  Not  just  any 
amino  acid  will  serve  at  a particular  position  in  a protein  molecule,  because 
each  amino  acid  must  participate  in  the  chemical  interactions  that  fold  the 
molecule  into  its  three-dimensional  shape  and  give  the  molecule  its  speci- 
ficity and  ability  to  function.  The  need  for  proper  chemical  interactions  and 
folding  constrains  the  acceptable  amino  acids  that  can  occupy  each  site. 
Although  some  amino  acid  replacements  may  be  functionally  equivalent  or 
nearly  equivalent,  many  more  are  expected  to  impair  protein  function  to  such 
an  extent  that  they  reduce  the  fitness  of  the  organisms  that  contain  them. 
Thus,  the  constraints  on  acceptable  amino  acids  are  selective  constraints 
because  unacceptable  amino  acid  replacements  are  eliminated  by  selection. 

If  an  amino  acid  replacement  does  occur,  its  effect  on  the  function  of  the 
protein  product  will  depend  on  many  factors,  but  one  of  the  most  important 
determinants  of  protein  conformation  is  the  charge  of  the  amino  acid.  Differ- 
ent amino  acid  replacements  give  different  numbers  of  charge  changes,  and 
in  most  cases  the  smallest  change  in  charge  might  be  expected  to  result  in  the 
smallest  conformational  change.  Peetz  et  al.  (1986)  examined  the  charge 
changes  in  the  evolution  of  seven  proteins,  and  found  that  hemoglobin  a, 
hemoglobin  (3,  myoglobin,  and  insulin  all  accumulated  charge  changes  at  a 
rate  slower  than  expected  by  random  substitution.  This  finding  is  consistent 
with  constraints  on  the  conformation  of  these  proteins  that  limit  permissible 
charge  changes.  On  the  other  hand,  cytochrome  c and  fibrinogens  A and  B 
accumulate  charge  changes  at  the  expected  neutral  rate. 

For  comparison  of  rates,  it  would  be  useful  to  study  rates  of  nucleotide 
substitution  in  stretches  of  DNA  wholly  devoid  of  function  and  therefore 
subject  exclusively  to  the  whims  of  mutation  and  random  drift.  A likely  can- 
didate is  found  in  a class  of  genes  called  pseudogenes,  which  are  DNA 
sequences  that  are  homologous  to  known  genes  but  that  have  undergone  one 
or  more  mutations  eliminating  their  ability  to  be  expressed.  Pseudogenes  are 
thought  to  be  completely  nonfunctional  relics  of  mutational  inactivation, 
and,  in  fact,  their  extremely  rapid  rate  of  nucleotide  substitution  is  offered  in 
support  of  this  view.  The  average  rate  of  nucleotide  substitution  in  pseudo- 
genes is  faster  than  the  average  rate  found  in  intervening  sequences,  flanking 
regions,  and  fourfold  degenerate  (synonymous)  sites.  Pseudogenes  evolve  at 
the  fastest  rates  known,  which  may  correspond  to  rates  of  substitution  when 
DNA  is  completely  unconstrained  by  natural  selection.  The  fact  that  fourfold 
degenerate  sites  evolve  more  slowly  than  pseudogenes  may  be  a suggestion 
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that  these  sites  are  not  totally  lacking  in  constraint,  an  idea  we  shall  return  to 
shortly. 

Rates  of  nucleotide  substitution  also  vary  within  protein  molecules. 
Human  insulin  is  a good  illustration.  The  A and  B polypeptide  chains  found 
in  the  mature  insulin  molecule  are  created  by  post-translational  cleavage  of  a 
longer  polypeptide  known  as  preproinsulin.  Preproinsulin  contains  a signal 
peptide  for  secretion  and  an  internal  C-peptide,  neither  of  which  are  present 
in  the  active  molecule.  The  rates  of  nucleotide  substitution  in  these  three 
regions  are  0.16  for  the  A and  B chains,  0.99  for  the  C peptide,  and  1.16  for  the 
signal  peptide.  [As  in  Li  et  al.  (1985),  rates  are  expressed  in  terms  of  nonsyn- 
onymous  nucleotide  substitutions  per  nonsynonymous  site  per  billion 
years.]  In  insulin,  while  there  is  a sevenfold  difference  between  the  maximum 
and  minimum  rates  of  nonsynonymous  substitution  in  different  regions  of 
the  molecule,  the  rates  of  synonymous  substitution  differ  only  twofold. 
Moreover,  there  is  a negative  correlation  between  functional  importance  and 
rate  of  nonsynonymous  substitution  within  the  insulin  molecule.  Many 
diverse  amino  acid  sequences  can  serve  as  signal  peptides  provided  they  are 
hydrophobic,  which  suggests  that  selective  constraints  on  signal  peptides 
may  be  reduced  in  comparison  with  sequences  in  mature  polypeptides.  In 
insulin,  as  expected,  the  rate  of  nonsynonymous  substitution  is  fastest  in  the 
signal  peptide  and  slowest  in  the  functional  subunits  of  the  mature  molecule. 
This  kind  of  negative  correlation  between  selective  constraint  and  substitu- 
tion rate  has  also  been  observed  in  several  other  proteins  (Li  et  al.  1985). 

Within-Species  Polymorphism 

So  far  we  have  talked  only  about  differences  between  nucleotide  sequences  of 
genes  from  distinct  species.  DNA  sequence  differences  between  alternative  alle- 
les of  the  same  gene  in  a single  species  may  also  be  synonymous  or  nonsyn- 
onymous, and  it  is  instructive  to  compare  levels  of  within-species  polymor- 
phism at  synonymous  vs.  nonsynonymous  sites.  In  this  case  we  do  not  general- 
ly talk  about  substitution  rate,  but  rather  quantify  the  variability  with  the  nucle- 
otide diversity.  Nucleotide  diversity,  often  symbolized  with  the  Greek  letter  n, 
is  the  probability  that  a sample  of  a particular  nucleotide  site  drawn  from  two 
individuals  will  differ.  It  is  essentially  the  heterozygosity  at  the  nucleotide  level. 

Figure  8.15  illustrates  the  first  systematic  study  of  DNA  sequence  variation 
in  a set  of  11  alcohol  dehydrogenase  alleles  of  Drosophila  melanogaster  (Kreitman 
1983).  Of  the  2659  nucleotides  sequenced,  52  were  variable  across  the  11  alleles. 
The  nucleotide  diversity  over  the  entire  gene  was  0.0065  ± 0.0017,  meaning  that 
99.4%  of  the  time,  pairs  of  alleles  will  match  at  a site.  The  level  of  nucleotide 
diversity  differs  in  different  regions  of  genes.  Figure  8.16  illustrates  the  esti- 
mates of  nucleotide  diversity  found  in  different  parts  of  the  Drosophila  Adh 
gene.  The  different  parts  are  the  5'  (upstream)  flanking  region,  the  5'  tran- 
scribed but  untranslated  region,  the  coding  region  (nonsynonymous  substitu- 
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Figure  8. 1 5 Polymorphic  nucleotide  sites  among  11  alleles  of  the  Adh  alcohol 
dehydrogenase  gene  of  D.  melanogaster.  The  first  line  gives  a consensus 
sequence  for  Adh  at  sites  that  vary;  subsequent  lines  give  the  nucleotides  from 
each  copy  for  the  polymorphic  sites.  A dot  indicates  that  the  site  is  identical  to 
the  consensus  sequence.  The  triangles  indicate  sites  of  insertion  or  deletion  rela- 
tive to  the  consensus  sequence.  The  star  in  exon  4 indicates  the  site  of  the  amino 
acid  replacement  ( threonine- to-lysine)  responsible  for  the  Fast-Slow  mobility 
difference  in  the  Adh  protein.  (After  Kreitman  1983.) 
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Figure  8.16  Nucleotide  diversity  in  Adh  of  Drosophila  melanogaster. 


tions  only,  with  both  the  slowest  and  fastest  rates  shown),  intervening 
sequences,  the  3'  (downstream)  transcribed  but  untranslated  region,  and  the 
3'  untranscribed  region.  On  the  average,  the  fastest  rates  of  substitution  occur 
in  intervening  sequences  and  the  3'  flanking  regions,  but  the  average  rates  in 
the  5'  flanking  regions  and  the  3'  untranslated  region  are  all  substantially 
faster  than  0.88  x 10'9,  which  is  the  average  rate  of  nonsynonymous  substitu- 
tion in  coding  regions  (see  Figure  8.14).  Neutralists  would  argue  that  the  high 
rates  of  substitution  in  noncoding  regions  and  variation  among  different  parts 
of  the  coding  region  result  from  varying  degrees  of  selective  constraints  on 
different  parts  of  the  gene.  It  is  to  be  emphasized  that  Figure  8.16  depicts  the 
results  for  just  one  gene,  and  in  individual  instances,  especially  in  compar- 
isons of  closely  related  species,  there  may  be  fewer  substitutions  observed  in 
flanking  sequences  than  in  coding  sequences,  or  fewer  changes  in  synony- 
mous sites  than  in  nonsynonymous  sites. 

Comparison  of  nucleotide  diversity  in  different  functional  regions  of  a 
single  gene  can  reveal  features  of  the  gene's  evolutionary  history.  For  exam- 
ple, in  11  sequenced  Adh  genes  of  Drosophila  melanogaster  (Kreitman  1983), 
among  14  substitutions  that  were  observed  in  the  coding  region,  13  were 
silent  substitutions.  Considering  the  genetic  code  and  the  codon  usage  in  the 
Adh  gene,  it  is  possible  to  calculate  what  portion  of  the  substitutions  would 
be  silent  if  all  substitutions  occurred  with  equal  frequency.  This  figure  is 
about  30%  in  the  case  of  the  Adh  gene  in  Drosophila,  which  implies  that  about 
70%  of  the  substitutions  would  be  expected  to  cause  amino  acid  replace- 
ments. Since  only  one  out  of  14  observed  substitutions  was  an  amino  acid 
replacement,  such  substitutions  are  greatly  underrepresented.  This  finding  is 
consistent  with  the  view  that  most  amino  acid  replacements  are  eliminated 
from  the  population  by  purifying  selection.  The  same  logic  can  be  extended 
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to  argue  that  sequences  that  are  conserved  are  likely  to  be  functionally 
important;  this  type  of  reasoning  led  to  the  identification  of  a new  open  read- 
ing frame  in  the  HIV  (AIDS)  virus  genome  (Miller  1988). 

The  action  of  natural  selection  can  sometimes  be  inferred  from  levels  of 
synonymous  and  nonsynonymous  polymorphism.  For  genes  that  determine 
surface  antigens  of  pathogens  or  those  that  determine  the  major  histocom- 
patibility antigens  of  mammalian  cells,  the  rates  of  nucleotide  substitution 
can  be  quite  high.  One  way  to  address  whether  the  high  rate  of  substitution 
is  driven  by  selection  is  to  examine  the  levels  of  synonymous  and  nonsyn- 
onymous diversity  in  these  genes.  For  example,  Hughes  and  Nei  (1988) 
found  that  in  the  regions  coding  for  the  antigen  recognition  sites  in  the 
class  I MHC  (major  histocompatibility  complex)  genes  of  humans  and  mice, 
the  rate  of  nonsynonymous  substitution  exceeded  the  rate  of  synonymous 
substitution  by  a ratio  of  3 : 1.  This  ratio  is  the  reverse  of  that  found  in  the 
usual  situation  and  in  other  regions  in  the  same  genes,  where  silent  substitu- 
tions are  present  in  excess.  The  excess  of  amino  acid  replacements  is  consis- 
tent with  a model  in  which  mutations  that  generate  diversity  are  often 
advantageous,  and  hence  natural  selection  accelerates  the  substitution 
process.  Endo  et  al.  (1996)  developed  software  to  scan  the  gene  sequence 
databases  for  cases  in  which  the  nonsynonymous  rate  significantly  exceeded 
the  synonymous  rate,  and  they  recovered  17  cases.  Nine  of  these  17  cases 
were  cell  surface  antigens  or  immune  system  genes — proteins  for  which  one 
can  easily  imagine  scenarios  in  which  high  levels  of  diversity  are  advanta- 
geous. High  rates  of  nonsynonymous  substitution  are  also  found  in  protein 
toxins  called  colicins  that  certain  bacteria  produce  to  kill  potential  competi- 
tors in  their  immediate  vicinity  (Riley  1993;  Ayala  et  al.  1994). 

Implications  of  Codon  Bias 

Synonymous  substitutions  occur  at  a greater  rate  than  nonsynonymous  sub- 
stitutions, implying  that  they  face  weaker  selective  constraints.  But  are  syn- 
onymous changes  completely  neutral,  or  do  they  too  face  some  form  of  con- 
straint? One  potential  type  of  constraint  occurs  through  codon  preferences, 
which  are  correlated  with  the  relative  abundance  of  tRNA  molecules  that 
interact  with  and  translate  the  codons.  In  bacteria  and  yeast,  for  example, 
highly  abundant  proteins  tend  to  use  codons  for  abundant  tRNA  molecules, 
whereas  proteins  produced  in  small  amounts  tend  toward  codons  for  less 
abundant  tRNA  molecules  (Ikemura  1985).  A plot  of  the  frequency  of  use  of 
the  synonymous  codons  that  code  for  leucine  shows  that  CUG  is  much  more 
frequent  than  the  others,  corresponding  to  an  increased  abundance  of  this 
tRNA.  A second  potential  constraint  on  synonymous  substitutions  occurs 
through  possible  secondary  structures  that  the  RNA  might  form,  in  which 
certain  nucleotides  must  undergo  base  pairing  (see  the  next  section  for  an 
elaboration).  Pre-messenger  RNA  secondary  structure  may  influence  the 
speed  or  accuracy  of  intron  splicing,  rate  of  transport,  or  stability.  A third 
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potential  constraint  on  synonymous  substitutions  is  related  to  the  fact  that, 
during  translation,  the  probability  of  misincorporation  of  the  wrong  amino 
acid  increases  if  there  is  a pause  while  the  translation  machinery  waits  to 
find  a rare  tRNA.  Such  translation  errors  are  known  to  occur  (in  fact,  mis- 
translation of  an  mRNA  that  bears  a frameshift  mutation  can  yield  an  active 
protein).  Pausing  during  translation  may  also  be  of  importance  to  the  fold- 
ing of  the  protein  into  its  proper  three-dimensional  structure. 

If  synonymous  codons  are  neutral,  then  one  would  expect  their  frequen- 
cies of  use  to  correspond  to  the  product  of  the  nucleotide  frequencies.  If  all 
four  bases  were  equally  frequent,  all  synonymous  codons  should  be  used 
equally  frequently.  A more  subtle  way  to  test  for  departure  from  equal  codon 
use  is  to  count  the  incidence  of  polymorphisms  and  substitutions  toward  or 
away  from  the  most  abundant  codon.  If  the  most  abundant  codon  became 
the  most  abundant  by  chance,  then  the  substitutions  toward  and  away  from 
this  codon  should  show  no  bias.  But  if  the  most  abundant  codon  is  "pre- 
ferred, then  there  will  be  a deficit  of  substitutions  away  from  this  codon. 
Application  of  this  kind  of  approach  for  codon  bias  in  E.  coli  suggested  an 
average  selection  coefficient  against  disfavored  codons  of  about  s = 7.3  x 1 CT9 
(Hard  et  al.  1994).  Even  Drosophila,  whose  effective  size  might  be  around  106, 
appears  to  exhibit  significant  codon  preference  (Akashi  1995),  suggesting 
that  selective  constraints  on  synonymous  codons  must  be  greater  than  1CT6  in 
this  organism  (Figure  8.17). 

From  the  selectionist  viewpoint,  while  granting  that  substitutions  in 
pseudogenes  may  be  neutral,  and  synonymous  substitutions  may  be  con- 
strained by  natural  selection  only  weakly,  it  is  nevertheless  maintained  that 
nucleotide  substitutions  that  change  amino  acid  sequences  are  inevitably 
subject  to  the  action  of  natural  selection  of  an  intensity  that  is  sufficient  to 
counteract  the  effects  of  random  genetic  drift.  Thus,  selectionists  would  argue 
that  amino  acid  substitutions  that  have  occurred  in  a protein  during  the 
course  of  evolution  became  fixed  by  natural  selection  because  they  increased 
the  fitness  of  the  carriers  through  improvement  in  function  of  the  molecule. 
However,  neutralists  argue  back,  the  selectionist  viewpoint  cannot  easily 
explain  the  negative  correlation  between  functional  importance  and  rate  of 
substitution  within  proteins.  Furthermore,  a neutralist  might  add,  even  a 
slightly  detrimental  mutation  has  some  chance  of  being  fixed  unless  a popu- 
lation is  very  large  (Chapter  7). 

POLYMORPHISM  AND  DIVERGENCE  IN 
NUCLEOTIDE  SEQUENCE  DATA 

The  effects  of  varying  the  neutral  mutation  rate  on  levels  of  polymorphism 
within  a species  and  the  interspecific  divergence  in  nucleotide  sequences  are 
plotted  in  Figure  8.18.  The  theory  is  consistent  with  the  idea  that  genes  with 
a high  rate  of  nucleotide  substitution,  as  indicated  by  a large  number  of 
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Figure  8. 1 7 The  frequency  of  the  six  codons  that  encode  leucine  in 
Drosophila  melanogaster  is  not  uniform.  This  kind  of  codon  bias,  in  which  one 
codon  is  present  in  excess,  is  commonly  observed.  (Data  from  FlyBase, 
http://cbbridges. harvard. edu:7081.) 


interspecific  sequence  differences,  should  also  have  a high  level  of  intraspe- 
cific polymorphism.  Polymorphism  depends  only  on  the  product  of  the  neu- 
tral mutation  rate  and  the  effective  size,  through  the  formula  H = 0/(1  + 9) 
that  we  encountered  in  Chapter  7.  For  strictly  neutral  genes,  interspecific 
divergence  does  not  depend  on  the  population  size,  but  instead  follows  the 
formula  k = 2pf.  If  we  compare  two  genes,  the  level  of  intraspecific  polymor- 
phism would  let  us  estimate  a 0 value  for  each  gene.  Given  the  0 value  for 
gene  A and  the  observed  interspecific  divergence,  an  estimate  of  the  diver- 
gence time  could  be  estimated.  For  gene  B,  we  would  also  have  a 0 esti- 
mated from  the  level  of  polymorphism,  and  we  could  use  the  divergence 
time  estimated  from  gene  A to  determine  a predicted  value  of  divergence 
in  gene  B. 
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Figure  8.18  Reasoning  behind  the  HKA  test.  Consider  two  genes,  A and  B, 
that  differ  in  neutral  substitution  rate.  0 can  be  estimated  for  each  gene  based  on 
observed  levels  of  nucleotide  heterozygosity  (top  panel).  Given  the  observed 
divergence  between  two  species  in  gene  A (determined  by  the  neutral  mutation 
rate  and  time),  the  divergence  in  gene  B can  be  predicted  based  on  its  neutral 
substitution  rate,  and  the  divergence  time  obtained  from  gene  A.  The  HKA  test 
is  a goodness-of-fit  test  to  the  observed  levels  of  intraspecific  diversity  and  inter- 
specific divergence  under  a model  whose  parameters  are  population  sizes,  neu- 
tral mutation  rates,  and  times  of  divergence. 


The  above  reasoning  has  been  formalized  in  a popular  test  of  neutrality 
based  on  nucleotide  sequence  data  within  and  among  species  (Hudson  et  al. 
1987).  Sequences  of  at  least  two  genes  from  a number  of  individuals  of  each 
of  two  species  are  needed  to  apply  the  test.  Define  S?  and  S'?  as  the  number 
of  polymorphic  nucleotide  sites  in  gene  i in  species  A and  B,  respectively,  and 
d,  as  the  number  of  differences  in  gene  i between  a pair  of  alleles  sampled 
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randomly,  one  from  species  A and  one  from  species  B.  The  expected  values  of 
these  parameters  are  obtained  from  the  infinite-sites  neutral  model,  assum- 
ing that  the  two  species  diverged  t generations  ago,  that  the  population  sizes 
are  2 N and  2 Nf,  and  that  each  gene  has  an  associated  6,  = 4Np,.  Estimates  of 
0,,/,  and  t are  obtained  by  a least-squares  method  that  gives  the  best  fit  of  the 
expressions  for  the  expected  values  and  variances  of  Sf,  S'!,  and  d,  to  the  data, 
and  goodness-of-fit  is  tested  with  an  appropriate  chi-square  test.  Using  data 
from  the  Adh  coding  and  5'  flanking  regions  in  D.  melanogaster  and  D.  sechel- 
lia,  Hudson  et  al.  (1987)  found  that  the  observed  values  deviated  significant- 
ly from  the  neutral  model  in  a direction  consistent  with  the  operation  of 
balancing  selection  acting  on  the  coding  region  of  Adh.  This  finding  is  consis- 
tent with  Kreitman's  (1983)  observation  of  an  excess  of  silent  substitutions  in 
Adh,  except  that  the  test  of  Hudson  et  al.  makes  use  of  the  genetic  variation 
observed  within  and  among  species.  The  "HKA"  test  has  seen  many  appli- 
cations in  molecular  population  genetics  (Kreitman  and  Hudson  1991; 
Aguade  et  al.  1992;  Begun  and  Aquadro  1993;  Gaut  and  Clegg  1993). 


PROBLEM  8.7  In  a set  of  12  Adh  sequences  in  Drosophila  mel- 
anogaster, McDonald  and  Kreitman  (1991)  observed  42  silent  (synony- 
mous) polymorphisms  and  two  replacement  (nonsynonymous) 
polymorphisms.  As  had  been  concluded  by  Kreitman  (1983),  this 
suggests  that  most  replacement  mutations  are  deleterious  and  are 
eliminated  from  the  population.  When  they  examined  fixed  differ- 
ences between  melanogaster  and  either  D.  simulans  or  D.  yakuba,  they 
found  that  seven  of  the  fixed  differences  were  replacements  and  sev- 
enteen were  silent.  What  is  the  significance  of  this  observation? 


ANSWER  A null  hypothesis  might  be  that  the  effects  on  fitness  of  a 
mutation  would  be  the  same  whether  within  a species  or  at  any  time 
along  the  ancestral  history  of  two  species  back  to  the  common  ances- 
tor. If  this  is  true,  then  we  would  expect  the  ratio  of  silent  to  replace- 
ment polymorphisms  to  be  the  same  as  the  ratio  of  silent  to 
replacement  fixed  differences.  A simple  test  of  this  is  to  do  a 2 x 2 con- 
tingency chi-square: 

Fixed  Polymorphic 


Replacement 

Silent 


7 

17 


2 

42 
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For  this  table  we  get  = 8.20,  and  with  one  degree  of  freedom, 
P < 0.01.  (A  correction  is  often  applied  to  the  chi-square  for  tables  with 
counts  less  than  5,  but  it  does  not  make  much  difference  in  this  case.) 
The  low  probability  means  we  reject  the  null  hypothesis  and  conclude 
that,  within  species,  there  is  a tendency  to  avoid  replacement  poly- 
morphisms; however,  between-species  replacement  differences  are 
much  more  likely  to  occur.  McDonald  and  Kreitman  (1991)  argue  that 
this  pattern  is  consistent  with  adaptive  fixation  of  amino  acid  replace- 
ments, since  they  are  relatively  more  frequent  in  interspecific  compar- 
isons, and  such  adaptive  polymorphisms  would  be  less  common  than 
neutral  polymorphisms  because  adaptive  differences  would  not 
remain  polymorphic  for  as  long  a duration.  This  simple  test  is  useful 
in  assessing  the  relative  importance  of  neutral  drift  versus  selection  in 
interspecific  differences. 


Impact  of  Local  Recombination  Rates 

Recall  from  Chapter  5 that  the  level  of  polymorphism  in  Drosophila  shows  a 
striking  correlation  to  the  local  rate  of  recombination.  Regions  of  low  recom- 
bination rate  are  nearly  devoid  of  variation,  whereas  regions  with  high  rates 
of  recombination  are  highly  polymorphic.  The  idea  of  comparing  polymor- 
phism and  divergence  makes  this  pattern  even  more  striking  and  allows  us 
to  eliminate  a possible  cause.  One  possible  reason  for  the  correlation  is  that 
recombination  itself  is  mutagenic,  or  that  somehow  the  two  processes  are 
related  mechanistically.  (That  is,  perhaps  when  mutations  occur,  the  DNA 
configuration  is  altered  to  increase  the  chance  of  recombination.)  If  this  were 
the  case,  then  the  regions  of  low  recombination  rate  should  also  have  a low 
mutation  rate,  and  hence  lower  interspecific  divergence.  Figure  8.19  shows 
that  a lower  divergence  is  not  observed.  Levels  of  interspecific  divergence  are 
independent  of  local  recombination  rates.  The  conclusion  is  that  the  correla- 
tion between  recombination  rates  and  levels  of  polymorphism  observed  by 
Aquadro  et  al.  (1994)  must  be  due  to  more  rapid  elimination  of  the  variation 
in  regions  of  low  recombination. 

Two  known  mechanisms  that  remove  variation  faster  in  regions  of  low 
recombination  are  selective  sweeps  and  background  selection.  Background 
selection  is  thought  to  be  the  primary  mechanism  for  the  reduced  variation 
(discussed  in  Chapter  5 in  the  section  on  linkage  and  recombination),  but  this 
does  not  mean  that  sweeps  do  not  occur.  Selective  sweeps  occur  when  a 
favorable  mutation  takes  place,  and  selection  rapidly  increases  its  frequency. 
Such  sweeps  can  have  a dramatic  effect  on  levels  of  variation  in  the  selected 
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Figure  8. 1 9 The  striking  correlation  between  local  rates  of  recombination  and 
levels  of  intraspecific  nucleotide  diversity  cannot  be  explained  by  a lower  muta- 
tion rate  in  regions  of  low  recombination.  If  regions  of  low  recombination  had 
low  rates  of  mutation,  the  interspecific  divergence  would  be  lower  in  these 
regions.  That  it  is  not  is  shown  by  these  data.  (From  Aquadro  et  al.  1994.) 


gene  and  the  region  around  it.  The  size  of  the  "swept"  region  depends  on  the 
rate  of  recombination  and  is  larger  for  regions  of  low  recombination.  This 
means  that  the  chance  that  a particular  site  has  been  swept  free  of  variation  is 
greater  in  regions  of  low  recombination,  assuming  the  density  of  selective 
sweeps  is  uniform  across  the  genome.  An  example  of  a selective  sweep  is  an 
esterase  B allele  in  the  mosquito  that  is  associated  with  pesticide  resistance 
(Figure  8.20).  The  resistant  allele  has  apparently  undergone  a nearly  global 
sweep,  judging  from  the  near  monomorphism  of  the  esterase  B gene  (Ray- 
mond et  al.  1991;  Ffrench  Constant  et  al.  1991).  We  do  not  know  how  frequent 
such  sweeps  are,  but  one  possible  means  of  identifying  them  is  to  score  many 
highly  polymorphic  markers  in  many  populations  and  look  for  regions  of 
reduced  variation.  Schlotterer  et  al.  (1997)  performed  such  a survey  and 
found  several  cases  of  individual  genes  in  single  populations  that  were 
depauperate  in  variation,  perhaps  due  to  a local  sweep  event. 


GENE  GENEALOGIES 

There  is  an  important  distinction  between  the  construction  of  trees  from 
sequences  of  genes  from  different  species  and  from  sequences  of  alleles  from 
a single  species.  The  former  yields  a customary  phylogenetic  gene  tree, 
while  the  latter  produces  what  is  called  a gene  genealogy.  The  relationships 
among  species  result  from  macroevolutionary  processes,  whereas  allelic  dif- 
ferences result  from  a number  of  microevolutionary  processes,  including 
aspects  of  genetic  transmission.  Once  the  nucleotide  sequences  of  alleles  are 
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Figure  8.20  Restriction  maps  of  the  esterase  B gene  from  global  samples  of  the 
mosquito  Culex  pipiens.  Note  the  identity  of  a haplotype  from  Egypt  through  Texas. 
This  haplotype  is  associated  with  insecticide  resistance,  and  probably  underwent  a 
global  sweep  in  the  face  of  strong  selection.  (From  Raymond  et  al.  1991.) 


known,  the  different  alleles  can  be  treated  like  genes  in  different  species  in 
applying  standard  methods  for  inferring  a phylogenetic  tree.  However,  great 
care  is  needed  in  constructing  gene  genealogies,  because  recombination 
among  the  sequences  results  in  a gross  violation  of  the  assumptions  of  most 
tree-building  methods.  Provided  the  rate  of  recombination  is  not  too  high, 
localized  blocks  of  sequence  can  be  identified  in  which  there  appears  to  have 
been  no  recombination  in  the  ancestral  history  of  the  sampled  alleles.  With 
this  caveat,  gene  genealogies  can  be  of  great  use  in  inferring  the  evolutionary 
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Figure  8.21  A phylogenetic  tree  for  11  Adh  alleles  of  Drosophila  melanogaster 
based  on  43  nucleotide  differences.  The  scale  is  the  number  of  nucleotide  differ- 
ences per  site.  Ja:  Japan;  Af:  Africa;  Wa:  Seattle,  Washington;  FI:  Southern  Flori- 
da; Fr:  France.  S and  F refer  to  the  slow  and  fast  electrophoretic  forms.  (Data 
from  Kreitman  1983.) 


history  of  a polymorphism.  For  example,  they  can  reveal  which  of  a group  of 
alleles  is  older,  or  which  alleles  are  more  closely  related  to  each  other.  Figure 
8.21  shows  the  gene  genealogy  from  Kreitman's  (1983)  Adh  sequence  data, 
and  the  higher  diversity  of  the  S allele  clearly  makes  it  appear  to  be  older. 

Hypothesis  Testing  Using  Trees 

Beyond  the  descriptive  approach  to  showing  relationships  among  alleles, 
gene  genealogies  can  be  used  to  test  fundamental  forces  of  population  genet- 
ics, including  natural  selection.  For  example,  consider  a phylogenetic  tree 
based  purely  on  neutral  variation.  As  illustrated  in  Figure  8.22A,  when  the 
substitution  rate  is  p,  the  expected  time  to  coalescence  to  a common  ancestor 
for  a randomly  chosen  pair  of  alleles  is  4 N generations  (Chapter  2).  Under  a 
model  like  Ohta's  (1973),  where  many  mutations  are  slightly  deleterious,  the 
tree  is  not  changed  very  much  because  the  alleles  included  in  a sample  are 
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(A)  No  selection 


(B)  Purifying  selection 


Figure  8.22  Computer  simulations  of  the  infinite-allele  model  of  molecular 
evolution.  (A)  With  strict  neutrality,  the  expected  time  from  mutation  to  fixation 
of  alleles  that  will  go  to  fixation  is  4 Ne  generations.  (B)  Purifying  selection  (in 
this  case  with  half  of  the  mutations  having  a fitness  of  0.5)  results  in  less  poly- 
morphism at  any  given  time.  (C,  next  page)  Stabilizing  selection  (overdomi- 
nance or  frequency  dependence)  can  retain  alleles  in  a polymorphic  state  for 
much  longer  times.  Representative  trees  are  plotted  to  the  right  of  each  panel. 
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(C)  Stabilizing  selection 


the  subset  of  mutations  that  occurred  that  were  nearly  neutral.  On  the  other 
hand,  in  the  case  of  adaptive  mutations,  the  rate  of  fixation  would  be  much 
faster  than  with  neutrality,  so  that  sites  of  adaptive  mutation  would  have 
shorter  coalescence  times  than  flanking  neutral  sites  (Figure  8.22B).  Finally, 
with  balancing  selection  (heterozygote  advantage),  polymorphisms  would 
be  maintained  for  a longer  time  than  under  the  pure  drift  model  (Figure 
8.22C).  The  number  of  statistical  methods  for  inference  of  population  genet- 
ic forces  from  gene  genealogies  is  increasing  rapidly,  and  there  is  ample 
opportunity  for  exciting  progress  in  this  area. 


PROBLEM  8.8  A study  of  variation  in  the  gene  encoding  superox- 
ide dismutase  in  Drosophila  melanogaster  (Hudson  et  al.  1994b) 
revealed  63  polymorphic  sites  in  three  slow  alleles  and  22  fast  alleles 
(where  fast  and  slow  refer  to  the  mobility  of  the  protein  product  in  an 
electrophoretic  gel).  An  additional  16  sloio  alleles  were  separately 
scored,  giving  a total  of  19  slow  alleles  that  were  found  to  be  identical 
in  nucleotide  sequence.  The  fast  allele  broke  into  10  distinct  haplo- 
types,  and  the  most  common  was  FastA  with  nine  copies.  The  partial 
table  of  pairwise  counts  of  numbers  of  sites  that  differ  between 
alleles  is: 
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FastA  FastH 

FastB 

Fast} 

FastK 

Slow  1 3 

4 

9 

16 

FastA  2 

3 

8 

15 

FastH 

3 

10 

17 

FastB 

11 

18 

Fast J 

7 

How  would  you  address  the  question  of  whether  this  sample  is  tvpi- 

cal  of  a sample  from  a neutral  gene? 

ANSWER  The  aspect  of  the  pattern  of  variation  that  is  unusual  is 
that  the  fast  alleles  appear  to  be  quite  variable,  whereas  all  19  slow  alle- 
les are  identical.  A gene  genealogy  of  the  fast  alleles  would  look  like  a 
typical  neutral  tree  with  roughly  exponentially  distributed  branch 
lengths,  but  the  complete  tree  would  then  have  19  identical  slow  alle- 
les placed  one  substitution  away  from  FastA.  The  suspicion  is  that 
the  slozv  allele  must  have  arisen  recently  and  is  being  pulled  to  high 
frequency  by  selection.  An  observed  increasing  trend  in  the  slow  allele 
frequency  supported  this  conjecture.  To  make  a formal  test  out  of  this 
observation,  Hudson  et  al.  (1994b)  used  the  coalescent  procedure 
described  in  Chapter  7 to  generate  simulated  data  sets  with  a sample 
size  of  25  and  having  63  polymorphic  sites.  For  each  of  the  10,000  sim- 
ulated samples,  they  asked,  how  often  is  there  a set  of  12  alleles  that 
differ  by  0 or  1 substitutions?  (The  9 FastA  alleles  and  3 slow  alleles  in 
the  original  observed  sample  differ  at  just  1 site.)  The  answer  was  81 
of  the  10,000  cases,  giving  a probability  of  0.0081.  The  observed  sam- 
ple is  not  a likely  occurrence  under  neutrality. 


It  is  instructive  to  note  that  these  data  were  consistent  with  neutrality  by 
the  Fu  and  Li  (1993)  test,  the  Tajima  (1989)  test,  and  the  HKA  test  (Hudson  et 
al.  1987),  demonstrating  that  even  strong  departures  from  neutrality  may  be 
missed  by  these  standard  tests.  This  problem  illustrates  a common  principle 
in  molecular  population  genetic  analysis,  which  is  that  ad  hoc  approaches  tai- 
lored to  particular  observations  often  are  necessary. 

The  topology  of  gene  trees  affords  an  opportunity  for  yet  another  test  of 
goodness  of  fit  of  data  to  the  neutral  theory.  We  saw  in  Chapter  7 that  the 
coalescent  approach  provides  a description  for  the  expected  topology  of  a 
gene  tree  under  the  infinite-sites  model.  In  particular,  the  expected  time 
back  to  the  next  preceding  coalescent  event  is  exponentially  distributed 
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with  parameter  1/ 


v2, 


where  k is  the  current  number  of  distinct  alleles.  A 


test  of  Fu  and  Li  (1993)  makes  use  of  the  fact  that  the  model  predicts  a rela- 
tionship between  0 and  the  number  of  "external  mutations."  An  external 
mutation  is  a mutation  that  occurs  on  a branch  of  the  gene  genealogy  that 
terminates  in  an  observed  allele  (an  external  or  terminal  branch).  The 
remarkable  observation  that  Fu  and  Li  made  was  that  the  expected  number 
of  external  mutations  is  0,  independent  of  sample  size.  The  test  is  based  on  the 
idea  that  selection  will  affect  the  number  of  external  branches  more  than  it 
will  affect  internal  branches,  and  Fu  and  Li  devised  test  statistics  for  good- 
ness of  fit  between  observed  and  expected  numbers  of  external  mutations. 
The  test  has  some  advantages  over  the  Tajima  test,  but  Simonsen  et  al. 
(1995),  after  extensive  simulations  to  test  the  power  of  various  neutrality 
tests,  conclude  that  the  Tajima  test  (see  Problem  7.10)  is  generally  the  most 
powerful  against  alternative  hypotheses  of  selective  sweeps,  population 
bottlenecks,  or  population  subdivision  (but  see  Problem  8.8). 


Inferences  about  Migration  Based  on  Gene  Trees 

Data  from  a panmictic  population  obeying  the  infinite-site  model  will  have  a 
characteristic  gene  tree  topology.  If  the  population  is  divided  into  two  semi- 
isolated  groups,  the  alleles  within  each  group  will,  on  average,  be  more  sim- 
ilar to  one  another  than  comparisons  between  groups.  This  would  mean  that 
a gene  tree  in  such  a subdivided  population  would  have  two  major  clades 
corresponding  to  the  two  populations.  For  higher  levels  of  migration,  the 
gene  tree  will  be  somewhere  between  these  two  extremes.  Slatkin  and 
Maddison  (1989, 1991)  devised  means  for  estimating  the  number  of  migrants 
per  generation,  Nm,  from  the  inferred  gene  genealogy.  In  essence,  the 
approach  uses  parsimony  to  obtain  a direct  count  of  migration  events  com- 
patible with  the  tree,  and  Nm  is  estimated  from  this  count. 

With  sufficient  DNA  sequence  data,  one  can  be  confident  that  identical 
alleles  are  truly  identical  by  descent,  an  important  aspect  of  inference  of  pop- 
ulation history  and  migration.  In  their  analysis  of  D.  pseudoobscura  Adh 
sequences,  Schaeffer  and  Miller  (1991)  found  that  geographically  distant  pop- 
ulations had  identical  alleles,  and  more  generally,  that  the  gene  tree  did  not 
partition  geographically,  as  though  the  population  were  panmictic.  This  was 
an  exciting  result,  given  the  extraordinary  level  of  population  subdivision  in 
D.  pseudoobscura  third  chromosome  inversions.  It  implies  that  the  latter  sub- 
division is  not  just  a historical  accident,  but  is  being  maintained  in  the  face  of 
sufficient  migration  to  homogenize  other  sorts  of  genetic  variation. 

The  data  of  Bowcock  et  al.  (1994)  show  yet  another  aspect  of  very  high 
resolution  molecular  data.  After  constructing  a tree  based  on  30  microsatellite 
loci,  they  observed  that  human  samples  showed  a significant  tendency  to 
cluster  by  continent.  Although  lower-resolution  methods  had  shown  some 
degree  of  dissimilarity  among  groups  of  humans,  this  was  the  first  study  to 
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show  that  reduced  intercontinental  migration  was  sufficient  to  partition 
human  genetic  variation. 

MITOCHONDRIAL  AND  CHLOROPLAST  DNA  EVOLUTION 

We  already  saw  in  Chapter  5 that  mitochondrial  DNA  can  be  highly  infor- 
mative about  the  geographic  structure  of  populations.  Some  of  the  advan- 
tages of  using  DNA  sequence  variation  from  this  organelle  genome  include: 

• The  DNA  molecule  (in  most  animals)  is  relatively  small  and  easy  to  isolate. 

• It  is  present  in  multiple  copies  per  cell;  therefore,  older  and  less  well  pre- 
served samples  are  still  likely  to  yield  useful  information. 

• The  mitochondrial  genome  does  not  undergo  recombination,  so  it  is  more 
likely  to  show  a clean  branching  structure  to  its  gene  trees. 

• It  evolves  rapidly. 

The  primary  problems  with  mtDNA  are: 

The  absence  of  recombination  means  that  the  gene  tree  constructed  from 
any  mitochondrial  DNA  gene  will  reflect  just  a single  realization  of  the 
genealogical  process.  As  such,  the  data  will  not  be  as  informative  about 
species  or  population  trees  as,  say,  a dozen  nuclear  genes. 

Much  of  the  work  with  mtDNA  has  been  on  the  control  region  sequence. 
While  this  region  is  highly  variable,  the  variability  occurs  at  a subset  of 
sites  that  are  so  mutable  that  multiple  substitutions  often  occur. 

In  animals,  mitochondria  are  usually  inherited  through  the  egg  cytoplasm 
(maternal  inheritance)  and  are  genetically  uniform  within  an  individual.  The 
mitochondrial  genome  consists  of  a single  circular  DNA  molecule,  denoted 
mtDNA,  the  size  of  which  varies  over  a remarkably  narrow  range  in  different 
species  of  vertebrates  (15.7—19.5  kb),  averaging  about  16  kb.  Human  mtDNA 
is  fairly  typical,  containing  a control  region  for  the  initiation  of  DNA  replica- 
tion, genes  for  two  ribosomal  RNA  molecules,  22  transfer  RNA  molecules, 
and  13  proteins.  Twelve  of  the  proteins  are  subunits  of  enzyme  complexes 
that  carry  out  electron  transport  and  ATP  synthesis.  The  genetic  code  of 
mammalian  mitochondria  differs  from  the  standard  code  in  that  ATA  codes 
for  Met,  TGA  codes  for  Trp,  and  AGR  codes  for  End  (termination  of  protein 
synthesis);  thus,  every  codon  in  the  mitochondrial  code  can  be  written  as 
either  NNY  or  NNR.  Animal  mitochondria  also  contain  several  hundred 
enzymes  used  in  metabolic  functions,  but  these  are  coded  for  by  nuclear 
genes,  and  the  enzymes  are  transported  into  the  mitochondria. 

At  the  nucleotide  level,  the  rates  of  substitution  in  mammalian  mtDNA 
are  typically  5 to  10  times  greater  than  occur  in  single-copy  nuclear  genes, 
averaging  approximately  10x10  substitutions  per  nucleotide  site  per  year. 
The  reason  for  the  high  rate  of  substitution  is  thought  to  be  either  a high  rate 
of  nucleotide  misincorporation  or  a low  efficiency  of  repair  of  the  DNA  poly- 
merase. Support  for  the  latter  view  comes  from  the  observation  that,  unlike 
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Figure  8.23  Relationship  between  percent  sequence  divergence  (lOOrf)  and 
divergence  time.  The  points  represent  estimates  from  pairwise  comparisons  of 
restriction  endonuclease  cleavage  maps.  The  initial  rate  of  mtDNA  sequence  is 
shown  by  the  longer  dashed  line  and  the  rate  of  divergence  of  single-copy 
nuclear  DNA  by  the  shorter  dashed  line.  (From  Brown  et  al.  1979.) 

the  nuclear  DNA  polymerase,  the  mitochondrial  DNA  polymerase  lacks  the 
proofreading  function.  In  protein-coding  mitochondrial  genes,  the  rate  of 
synonymous  substitution  is  about  five  times  greater  than  the  rate  of  nonsyn- 
onymous  substitution,  which  is  comparable  with  the  ratio  found  in  nuclear 
genes.  Mitochondrial  tRNA  genes  in  mammals  evolve  approximately  100 
times  as  rapidly  as  their  nuclear  counterparts  (Brown  1985;  Avise  1986).  One 
result  of  this  faster  rate  of  nucleotide  substitution  is  that  the  divergence 
between  two  sequences  saturates  relatively  soon,  so  that  the  linearity  of 
divergence  over  time  (the  molecular  clock)  is  an  accurate  approximation  only 
for  species  that  have  diverged  less  than  about  10  million  years  (Figure  8.23). 
Exceptions  to  the  elevated  rate  of  mtDNA  divergence  have  been  found, 
notably  in  Drosophila  (Powell  et  al.  1986). 


PROBLEM  8.9  The  mitochondrial  DNA  of  21  humans  of  diverse 
geographic  and  racial  origin  were  digested  with  18  restriction 
enzymes,  11  of  which  exhibited  one  or  more  fragments  in  which  size 
polymorphism  occurred  (Brown  1980).  All  restriction  site  polymor- 
phisms could  be  explained  by  single-nucleotide  differences,  thus 
there  was  no  evidence  for  insertions,  deletions,  or  other  mtDNA 
rearrangements.  Altogether,  868  nucleotide  sites  were  assayed  for  dif- 
ferences among  individuals,  and  the  average  number  of  differences 
per  nucleotide  site  per  individual  was  estimated  at  0.0018.  Assuming 
that  mammalian  DNA  undergoes  sequence  divergence  at  the  rate  of  5 to 
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10  x 10  nucleotide  substitutions  per  site  per  year,  and  that  the  rate  is 
uniform  in  time,  calculate  the  length  of  time  since  all  of  the  21  contem- 
porary mtDNA  molecules  last  shared  a common  ancestor.  Calculate  the 
effective  size  of  the  population  from  the  level  of  mtDNA  variability. 


ANSWER  Given  an  average  number  of  differences  per  nucleotide  site 
pemndividual  of  0.0018  and  an  average  rate  of  divergence  of  5 to  10  x 
10  per  site  per  year,  the  time  of  the  most  recent  common  ancestor 
would  be  between  0.0018/(10  x 10“9)  and  0.0018/(5  x 10~9)  or  180,000  to 

360.000  years.  Assuming  a generation  time  of  20  years,  this  means  that 
all  mtDNA  in  the  diverse  sample  could  have  been  from  a single  female 
in  the  population  between  9,000  to  18,000  generations  ago.  To  estimate 
the  long-term  effective  size  of  the  population,  recall  that  the  expected 
time  to  fixation  of  a newly  arisen  neutral  mutation  is  4 Ne  generations. 
This  result  applies  to  an  autosomal  gene  in  a diploid  species.  For  mito- 
chondrial genes,  only  females  transmit  them,  and  they  are  effectively 
haploid,  so  the  corresponding  fixation  time  for  mtDNA  is  just  Ne  gener- 
ations. If  we  argue  that  the  one  mtDNA  type  went  to  fixation  in  9,000  to 

18.000  generations,  this  is  equivalent  to  saying  that  the  long-term  pop- 
ulation size  has  been  Ne  = 9,000  to  18,000.  This  sounds  like  a low  num- 
ber, but  modern  anthropologists  find  it  reasonable,  given  the 
population  structure  of  ancient  humans  and  the  rapid,  nearly  starburst- 
like growth  since  the  adoption  of  agricultural  methods. 


One  of  the  most  dramatic  claims  in  the  history  of  population  genetics  was 
that  human  genetic  variation  in  mtDNA  indicates  a recent  African  origin  of 
modern  humans  (Cann  et  al.  1987).  This  claim  was  based  on  restriction  site 
variation  among  mtDNA  of  147  humans  in  five  populations.  The  12  restric- 
tion enzymes  sampled  an  average  of  370  restriction  sites  per  individual, 
equivalent  to  assaying  9%  of  the  mtDNA  genome  per  individual.  A total  of 
195  polymorphic  sites  were  found  in  the  genome,  and  the  precise  location  on 
the  mtDNA  sequence  of  all  polymorphic  sites  was  identified.  When  the  133 
distinct  mtDNA  haplotypes  were  assembled  into  a phylogenetic  tree,  a clade 
was  found  in  which  the  most  ancient  branch  pointed  to  a group  of  people  of 
African  ancestry  (Figure  8.24).  Given  the  observed  number  of  differences 
between  the  two  most  divergent  mtDNA  types,  and  assuming  there  is  2 to 
4%  divergence  in  mtDNA  sequences  per  million  years  (estimated  from  the 
human-chimp  split  at  5 MYA),  the  common  ancestor  to  all  of  the  observed 
haplotypes  was  estimated  to  have  existed  140,000  to  280,000  years  ago. 
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Figure  8.24  Parsimony  tree  of  mtDNA  variation  from  the  original  "mitochon- 
drial Eve"  paper.  Much  was  made  of  the  observation  that  there  is  an  isolated 
clade  consisting  only  of  Africans.  (From  Cann  et  al.  1987.) 
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Sequences  of  the  control  region,  which  diverge  at  a rate  of  12  to  15%  per  mil- 
lion years,  produce  a date  for  the  common  ancestor  of  166,000  to  249,000 
years  ago  (Vigilant  et  al.  1991).  Several  other  data  sets  have  been  collected  to 
address  the  issue  of  date,  and  all  have  produced  estimates  of  the  date  of  the 
common  ancestor  of  human  mtDNA  of  between  100,000  and  400,000  years 
ago  (Hasegawa  and  Horai  1991;  Pesole  et  al.  1992;  Ruvolo  et  al.  1993).  These 
figures,  and  their  interpretation,  have  launched  a controversy  centering  on: 
(1)  the  best  way  to  infer  the  time  of  the  common  ancestor,  (2)  the  meaning  of 
higher  African  diversity,  (3)  the  confidence  in  an  African  root,  (4)  the  neutral- 
ity of  human  mtDNA  variation,  and  (5)  the  implications  for  human  evolu- 
tion. Whether  modern  humans  migrated  out  of  Africa  in  the  past  200,000 
years  may  not  be  supported  with  statistical  rigor  by  mtDNA  alone  (Temple- 
ton 1993),  but  when  haplotypes  of  nuclear  genes  (Tishkoff  et  al.  1996),  or 
when  many  nuclear  genes  are  considered  in  addition,  the  case  for  African 
origin  is  strong  (Nei  and  Roychoudary  1993). 

We  must  be  careful  to  realize,  however,  that  the  fact  that  Africa  has  the 
greatest  genetic  diversity  today  does  not  by  itself  guarantee  that  modern 
humans  originated  in  Africa.  If  the  African  population  has  had  a long-term 
effective  size  much  larger  than  other  populations,  or  if  the  other  populations 
suffered  a bottleneck  that  Africa  did  not,  then  Africa  would  be  more  diverse 
no  matter  where  humans  originated  (Relethford  1995).  In  addition,  just 
because  a gene  genealogy  appears  to  have  a root  that  coincides  with  an 
African  allele  does  not  mean  that  modern  humans  came  from  an  expansion  of 
the  African  population  to  cover  the  earth.  It  only  means  that  the  one  gene  has 
the  observed  ancestral  history.  Other  genes  may  trace  back  to  other  origins. 

Inferences  about  human  origins  from  extant  patterns  of  genetic  variation 
require  an  understanding  of  nonequilibrium  models,  where  populations 
grow  in  size,  new  colonies  are  founded,  and  populations  remained  connect- 
ed by  some  level  of  migration.  Recently  there  has  been  much  attention  paid 
to  the  influence  of  past  changes  in  population  size  on  patterns  of  variation.  It 
was  observed  that  a growing  population  produces  a gene  genealogy  that  has 
a more  starlike  shape  than  does  a stationary  population,  and  this  in  turn  pro- 
duces a peak  in  the  distribution  of  pairwise  counts  of  mismatches  (Slatkin 
and  Hudson  1991;  Rogers  and  Harpending  1992).  The  use  of  patterns  of 
human  genetic  variation  to  make  inferences  about  our  ancestral  history  is  an 
active  and  lively  area  of  inquiry. 

Chloroplast  DNA  and  Organelle  Transmission  in  Plants 

Chloroplasts  are  cellular  organelles  that  also  have  their  own  genome  and 
also  are  transmitted  in  a non-Mendelian  fashion.  Chloroplast  DNA  (cpDNA) 
ranges  in  size  from  135  to  160  kb,  and  it  occurs  in  multiple  copies  in  each 
chloroplast.  Its  structural  organization  is  conserved  in  higher  plants,  and  the 
rate  of  synonymous  nucleotide  substitution  is  approximately  1 x 10“9  substi- 
tutions per  site  per  year.  Thus,  the  evolution  of  cpDNA  is  conservative  in 


366 


Chapter  8 


TABLE  8.3  RATES  OF  SEQUENCE  AND  STRUCTURAL  EVOLUTION  IN 
ORGANELLE  DNA 

Rate  of  nucleotide 

Rate  of  structural 

Genome 

substitution 

evolution 

Angiosperm  cp  DNA 

Slow 

Slow 

Angiosperm  mtDNA 

Slow 

Rapid 

Mammalian  mtDNA 

Rapid 

Slow 

Fungal  mtDNA 

Rapid 

Rapid 

regard  to  both  sequence  and  structure  (Table  8.3).  The  opposite  extreme,  with 
a very  fast  rate  of  evolution,  is  found  in  the  mtDNA  of  fungi,  which  changes 
rapidly  in  both  sequence  and  structure. 

The  mtDNA  of  angiosperm  plants  has  the  opposite  pattern  of  evolution  as 
found  in  animal  mtDNA.  In  sequence  evolution  the  rate  in  angiosperms  is 
slow,  but  in  structural  evolution  it  is  fast.  In  plants,  the  mtDNA  genome  is 
large  and  highly  complex.  In  some  instances,  a single  molecule  can  resolve 
itself  into  smaller  circles  and  even  linear  molecules.  For  example,  in  the 
turnip  ( Brassica  campestris),  a 218  kb  molecule  undergoes  an  internal  recom- 
bination event  that  produces  smaller  circles  of  135  kb  and  85  kb.  Maize 
mtDNA  contains  six  pairs  of  repeated  sequences  that  can  undergo  recombi- 
nation and  create  a variety  of  structural  derivatives.  The  Arabidopsis  mtDNA 
genome  was  recently  sequenced,  and  although  it  is  366  kb,  nearly  all  the 
increase  in  size  compared  to  mammalian  mtDNA  is  noncoding  (Unseld  et  al. 
1996).  Many  plant  mitochondria  also  contain  autonomously  replicating  plas- 
mid DNA  molecules,  and  mtDNA  is  also  capable  of  incorporating  segments 
of  cpDNA.  Why  plant  mtDNA  genomes  are  so  large,  complex,  and  variable 
in  size  is  not  understood. 

Maintenance  of  Variation  in  Organelle  Genomes 

Organelle  genomes  have  unusual  population  genetics  because  of  their  (typ- 
ically) uniparental  transmission  and  because  many  copies  are  passed  from 
the  mother  to  the  progeny  through  the  egg.  Uniparental  transmission  has 
important  implications  in  the  operation  of  natural  selection,  since  it  is 
equivalent  to  a haploid  clonal  population  structure,  and  pure  selection  mod- 
els can  maintain  polymorphism  in  such  populations  only  if  the  fitnesses  are 
frequency  dependent.  From  the  outset,  then,  uniparental  transmission 
makes  it  less  likely  for  polymorphisms  to  be  maintained  by  natural  selec- 
tion, even  if  epistatic  effects  with  the  nuclear  genome  are  allowed  (Clark 
1984).  The  widespread  polymorphisms  observed  in  mtDNA  must  then  be 
attributed  largely  to  high  mutation  rates,  just  as  the  rapid  substitution  rate 
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was  attributed  to  a high  mutation  rate.  Polymorphisms  can  also  be  main- 
tained by  interspecific  hybridization,  and  it  is  possible  to  obtain  estimates 
of  rates  and  directions  of  interspecific  matings  from  nuclear  and  mtDNA 
data  (Asmussen  et  al.  1987).  Unusual  forms  of  transmission,  such  as  the 
doubly  uniparental  transmission  of  the  mussel  Mytilus  edulis,  results  in  sep- 
arate male  and  female  lineages,  which  are  highly  divergent  (Skibinski  et  al. 
1994;  Stewart  et  al.  1995). 

The  theory  of  random  genetic  drift  for  organelles  is  more  complex  than 
that  for  nuclear  genes  because  individual  cells  have  many  organelles  that  are 
apportioned  among  daughter  cells;  thus  there  is  an  additional  level  of  sam- 
pling when  heteroplasmic  cells  divide.  Models  of  the  dual  sampling  process 
have  been  examined  in  some  detail  (Birky  et  al.  1983;  Takahata  1983, 1984). 
These  models  predict  some  level  of  heteroplasmy,  and  although  early  empir- 
ical studies  did  not  detect  heteroplasmy,  it  has  now  been  described  in  crick- 
ets (Harrison  et  al.  1987),  Drosophila  (Hale  and  Singh  1986;  Solignac  et  al. 
1983,  1984, 1987),  lizards  (Densmore  et  al.  1985),  mice  (Boursot  et  al.  1987), 
cattle  (Hauswirth  and  Laipis  1982),  frogs  (Monnerot  et  al.  1984),  treefrogs, 
and  bowfin  fish  (Bermingham  et  al.  1986).  Heteroplasmy  can  be  maintained 
by  a steady-state  balance  between  the  forces  of  random  genetic  drift  and 
mutation,  but  heteroplasmy  is  most  frequently  observed  in  restriction  length 
polymorphisms,  in  which  variants  differ  in  the  number  of  copies  of  a small 
repeat.  Simple  deterministic  models  show  that  heteroplasmy  can  be  stably 
maintained  by  infrequent  paternal  transmission  (leakage),  by  natural  selec- 
tion, or  by  bi-directional  mutation,  such  as  the  gain/loss  events  one  would 
expect  for  changes  in  copy  number  of  a small  repeat  (Clark  1988).  Distribu- 
tions of  heteroplasmy  in  the  field  cricket  are  consistent  with  a model  of 
mutation-selection  balance,  with  smaller  genomes  favored  by  selection  (Rand 
and  Harrison  1986). 

Evidence  for  Selection  in  mtDNA 

There  are  several  clear  examples  of  nonneutrality  of  mtDNA  mutations.  For 
example,  many  forms  of  cytoplasmic  male  sterility  are  caused  by  defects  in 
mtDNA  (Grun  1976;  Levings  1983).  Similarly,  cytoplasmically  transmitted 
drug  resistance  genes  have  been  shown  to  be  associated  with  the  mitochon- 
drial genome  of  yeast.  The  potential  importance  of  mtDNA  variation  in 
human  health  was  revealed  in  the  implication  of  mitochondrial  DNA  defects 
in  the  muscle  diseases  known  as  mitochondrial  myopathies.  The  celebrated 
bicycle  racer  Greg  Lemond,  a three-time  winner  of  the  Tour  de  France,  was 
forced  into  early  retirement  by  a defect  in  mitochondrial  oxidative  metabo- 
lism. Effects  of  natural  selection  also  have  left  their  mark  on  extant  patterns 
of  mtDNA  sequence  variation,  as  revealed  by  the  discordance  between 
levels  of  polymorphism  and  divergence  in  synonymous  vs.  nonsynonymous 
sites  (Ballard  and  Kreitman  1994;  Rand  and  Kann  1996).  The  strongly 
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skewed  distribution  of  frequencies  of  segregating  sites  also  suggests  that 
human  mtDNA  has  faced  selection  pressure  (Hey  1997). 

If  a cytoplasmically  related  factor  of  any  sort  is  associated  with  a particular 
mtDNA  type,  then  the  mtDNA  will  "hitchhike"  along  with  the  other  cytoplas- 
mic factor.  A striking  example  of  this  mode  of  evolution  in  action  was  caught  by 
Turelli  et  al.  (1992),  when  they  noticed  that  a cytoplasmically  transmitted  Wol- 
bachia  infection  in  Drosophila  simulans  was  rapidly  spreading  north  in  California, 
and  as  it  did  so,  it  propelled  a single  mtDNA  type  to  high  frequency.  While  the 
mtDNA  genome  may  seem  small,  its  uniparental  transmission  makes  it  suscep- 
tible to  any  cytoplasmic  factor  that  may  carry  a particular  cytoplasmic  type  to 
fixation.  However,  most  populations  have  fairly  high  levels  of  mtDNA  variation, 
suggesting  that  such  sweep  events  are  not  very  common. 

MOLECULAR  PHYLOGENETICS 

The  use  of  techniques  of  molecular  biology,  particularly  those  for  determin- 
ing amino  acid  or  nucleotide  sequences,  has  added  a new  dimension  to  phy- 
logenetic inference.  For  example,  the  analysis  of  5S  RNA  sequences  in  a 
broad  variety  of  microorganisms  has  led  to  a reclassification  at  the  deepest 
of  phylogenetic  levels,  resulting  in  a new  kingdom,  the  Archaea  (Woese 
1981).  In  addition  to  the  satisfaction  of  understanding  the  history  of  rela- 
tionships among  living  things,  the  application  of  comparative  molecular 
analysis  to  infer  robust  and  accurate  phylogenetic  relationships  has  spawned 
interest  in  the  application  of  those  phylogenetic  trees  for  testing  hypotheses 
about  evolutionary  mechanisms.  The  problem  of  inferring  the  correct 
branching  topology  for  a tree  that  relates  a set  of  organisms  is  a challenge  in 
part  because  of  the  enormous  number  of  possible  bifurcating  trees.  If  there 
are  n species  to  be  placed,  there  are  (2 n - 3)1/2 "~2(n  - 2)!  rooted  trees  that 
describe  possible  ancestral  histories.  For  five  species  this  number  is  105,  and 
for  10  species  it  is  34,459,425.  For  many  data  sets  of  30  or  more  species,  the 
number  of  possible  trees  is  so  enormous  that  it  is  not  possible  to  examine  all 
topologies  and  assess  the  fit  of  the  data  to  each  tree,  even  with  the  very 
fastest  computers.  Fortunately,  the  trees  are  not  all  independent  of  one 
another,  and  the  key  to  many  of  the  algorithms  that  try  to  find  the  best  fit- 
ting tree  is  to  eliminate  whole  classes  of  trees  based  on  the  observed  data.  Let 
us  consider  a few  of  these  tree-building  methods. 

Algorithms  for  Phylogenetic  Tree  Reconstruction 

If  a gene  in  a pair  of  species  or  populations  evolves  in  clocklike  fashion,  and 
if  the  degree  of  divergence  between  two  genes  implies  that  they  have  been 
diverging  for  t generations,  then  we  can  infer  that  the  genes  separated  from 
a common  ancestor  f/2  generations  ago.  This  reasoning  provides  a group  of 
methods  of  tree  construction  based  on  measures  of  genetic  distance.  One 
such  method  is  the  unweighted  pair-group  method  with  arithmetic  mean 
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(UPGMA)  or  average  distance  method.  This  method  requires  that  all 
sequences  evolve  at  the  same  rate,  an  assumption  that  other  methods  can 
relax  to  some  degree,  but  the  ease  of  understanding  UPGMA  still  gives  it 
heuristic  appeal.  With  a matrix  of  all  pairwise  distances,  a tree  is  built  up  by 
first  grouping  the  two  species  with  the  smallest  distance.  A new  distance 
matrix  is  then  constructed,  with  the  grouped  species  now  considered  as  one 
unit.  If  the  grouped  species  were  indexed  i and  /',  then  for  all  k * i,j  the  dis- 
tance from  k to  the  group  {/,/')  is  dk{ij)  = V2 (dik  + d,().  In  words,  the  distance  from 
each  other  species  k to  the  group  {;',/}  is  the  average  of  the  distances  from 
species  k to  each  of  species  i and  j in  the  group.  The  new  distance  matrix  is 
again  searched  for  the  smallest  element,  and  the  appropriate  grouping  again 
occurs.  This  process  is  repeated  until  all  species  are  clustered  into  a tree. 

Tree-building  methods  can  not  only  produce  a tree  topology,  but  they 
generally  also  give  estimates  of  branch  lengths  of  the  tree.  An  example  of  one 
method  for  branch-length  estimation  is  the  method  of  Fitch  and  Margoliash 
(1967).  Suppose  the  number  of  substitutions  distinguishing  sequence  i and/ 
is  d,r  If  the  tree  relating  sequences  1,  2,  and  3 has  branch  lengths  A,  B,  and  C 
(Figure  8.25),  then  the  branch  lengths  can  be  estimated  from 

^ = V2(dn  + du  - d2 3) 

B = V2(du  + d23  - dl3)  8.17 

C = V2(rfi3  + rf23  ~ du) 

These  relations  were  found  by  solving  the  equations  du  = A + B,  du  = A + C, 
and  d23  = B + C.  With  more  than  three  sequences,  the  tree  is  built  up  by  con- 
sidering three  units  at  a time,  beginning  with  the  two  most  closely  related 
sequences  and  grouping  the  remaining  sequences.  If  sequences  1 and  2 are 
the  most  similar,  then  the  distances  from  sequence  1 to  the  remaining  group 
is  the  average  of  the  distances  from  sequence  1 to  each  member  of  the  group. 


Species  1 


Species  2 


Species  3 


Figure  8.25  A simple  phylogenetic  tree.  A,  B,  and  C represent  branch  lengths 
trom  the  most  recent  common  ancestor. 
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In  this  way,  only  three  distances  are  considered  at  a time,  and  Equations  8.17 
allow  branch  lengths  to  be  estimated.  This  method  is  known  as  least  squares, 
and  it  turns  out  that  Equations  8.17  minimize  the  sum  of  squared  deviations 
from  the  model,  much  like  linear  regression. 

Another  algorithm  for  tree  construction  is  particularly  well  suited  to  the 
situation  in  which  one  does  not  know  whether  rates  of  substitution  are  con- 
stant across  clades  of  the  tree.  This  method  is  known  as  neighbor-joining, 
because  it  groups  species  having  the  property  "neighbors"  (Saitou  and  Nei 
1987).  Begin  by  assuming  that  the  sequences  are  all  related  to  one  another  by 
a star  phylogeny  (Figure  8.26).  For  a star  phylogeny  with  N sequences,  the 
sum  of  the  branch  lengths  is 

S0  = Jjdtj/(N-l) 

<>/ 

(It  may  help  to  draw  a star  phylogeny  to  see  that  each  branch  gets  counted 
N - 1 times.)  Next  we  begin  a procedure  that  groups  certain  sequences 
together.  For  each  possible  pair  of  sequences,  a tree  like  that  in  step  1 in 
Figure  8.26  is  constructed.  Branch  lengths  for  this  tree  are  estimated  by  least 
squares,  and  the  sum  of  the  branch  lengths  for  the  entire  tree  (S,;)  is  calculat- 
ed. We  consider  as  neighbors  that  pair  of  sequences  i and  j that  give  the  min- 
imum of  the  Si's.  After  the  first  pair  of  neighbors  is  found,  that  pair  is  con- 
sidered as  a single  entity  (joined  neighbors),  and  the  process  of  considering 
all  possible  pairings  is  repeated.  The  distance  from  any  one  sequence  k 
to  this  pair  of  neighbors  (i  and  j)  is  the  average  of  the  two  distances,  or 
y2(4  + djk).  The  process  ends  when  there  are  just  three  neighbors  left,  and  at 
this  point  we  have  a finished  neighbor-joining  tree  complete  with  branch 
lengths.  The  criterion  for  neighbor-joining  is  to  minimize  the  sum  of  branch 
lengths,  and  sometimes  it  is  possible  to  find  tree  topologies  that  are  even  short- 
er, using  a method  called  minimum  evolution  trees  (Rzhetsky  and  Nei  1992). 


PROBLEM  8. 1 0 Consider  a sample  of  one  allele  drawn  from  each  of 
three  species.  Suppose  that  the  tree  that  one  gets  from  these  alleles 
may  be  represented  ((A,B),C),  implying  that  A and  B are  most  closely 
related,  and  C is  the  outgroup.  What  are  the  possible  relationships 
among  the  species  bearing  these  alleles? 


ANSWER  This  problem  bears  on  an  important  issue  in  phylogeny 
reconstruction,  namely  that  any  one  gene  tree  does  not  necessarily 
reflect  the  true  pattern  of  splitting  of  species.  The  easiest  way  to  see 
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Figure  8.26  Illustration  of  the  neighbor-joining  method  for  phylogeny  recon- 
struction. Given  a distance  matrix,  one  starts  with  a star  phylogeny  and  tests  all 
trees  having  different  pairs  separated  from  the  rest.  The  tree  with  A-B  joined  is 
the  shortest  such  tree.  The  process  of  testing  all  pairs  of  "neighbors",  where  a 
neighbor  may  be  either  a single  allele  or  a cluster  of  alleles,  is  repeated  until  no 
more  joining  can  be  done.  (See  Saitou  and  Nei  1987.) 


this  is  to  consider  ancestral  populations  as  being  polymorphic,  in 
which  case  the  speciation  process  may  sort  out  the  alleles  in  various 
ways.  It  turns  out  that  the  possible  species  trees  include  ((A,B),C), 
((A,C),B),  and  ((B,C),A).  In  other  words,  the  gene  tree  does  not  elimi- 
nate the  possibility  of  any  of  the  species  trees. 
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Distance  Methods  versus  Parsimony 

There  is  no  universal  theory  that  provides  a single  optimal  way  to  construct 
phylogenetic  trees,  and  as  basic  as  the  distance  matrix  seems,  it  is  not 
required  by  all  methods.  Another  method,  known  as  maximum  parsimony, 
uses  the  smallest  number  of  mutational  events  necessary  to  account  for  the 
evolution  of  a set  of  sequences  from  a common  ancestor  to  construct  the 
trees.  There  are  a number  of  such  parsimony  methods  based  on  trees  with 
the  smallest  number  of  substitutions,  but  none  guarantee  that  the  most  par- 
simonious tree  is  the  correct  tree.  For  example,  when  rates  of  substitution  dif- 
fer in  different  branches  of  the  tree,  the  parsimony  method  often  fails  to  give 
the  correct  topology  (Felsenstein  1978).  Methods  for  constructing  phyloge- 
netic trees  have  been  reviewed  by  Felsenstein  (1981,  1982)  and  more  recent- 
ly by  Nei  (1996).  Massive  simulation  studies  have  been  done  to  test  the  sta- 
tistical reliability  of  tree-constructing  methods  (Rohlf  and  Wooten  1988; 
Sourdis  and  Nei  1988;  Hillis  1996).  Results  of  these  simulations  are  easy  to 
summarize:  if  the  data  allow  one  method  to  assign  a topology  with  good  sta- 
tistical confidence,  generally  all  the  popular  methods  work  pretty  well.  But 
if  the  data  have  many  apparent  reverse  mutations,  variable  rates  among 
branches,  or  wide  variation  in  rates  across  sites,  then  none  of  the  methods 
works  very  well. 

Bootstrapping  and  Statistical  Confidence  in  a Tree 

Because  there  are  so  many  possible  tree  topologies,  it  is  important  to  assess 
how  much  statistical  confidence  one  can  place  in  a particular  tree.  One  can- 
not assign  a numerical  standard  error  to  a tree;  by  its  geometrical  nature  a 
tree  is  actually  a complicated  statement  of  phylogenetic  relationships,  such 
that  we  might  have  high  confidence  in  some  branches,  and  low  confidence 
in  others.  A widely  used  method  of  assessing  confidence  in  the  nodes  of  a 
tree  is  the  bootstrap  test  (Felsenstein  1985).  The  basic  idea  is  quite  simple:  a 
subset  of  the  original  data  is  drawn  with  replacement  and,  from  this  new 
data  set,  a tree  is  drawn.  For  each  node  in  the  original  tree,  we  ask  whether 
the  new  tree  has  the  same  cluster  of  sequences.  The  whole  operation  of 
resampling  the  data,  drawing  a tree,  and  tallying  up  nodes  that  are  in  the 
original  tree  is  repeated  perhaps  1000  times,  and  the  final  result  is  displayed 
graphically  as  a number  next  to  each  node  indicating  the  percentage  of  time 
that  cluster  is  present  among  the  resampled  trees.  If  that  fraction  is  high, 
then  one  gains  confidence  that  the  given  cluster  actually  belongs  together. 

Another  means  of  testing  the  statistical  confidence  in  a tree  is  to  test  the 
null  hypothesis  that  each  interior  branch  has  length  zero.  From  distance 
methods,  we  often  obtain  estimates  of  all  branch  lengths  in  the  tree,  along 
with  their  standard  errors.  If  we  fail  to  reject  the  null  hypothesis  of  zero 
length  for  an  interior  branch,  then  we  lose  confidence  in  the  nodes  surround- 
ing that  branch. 
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Shared  Polymorphism 

One  might  intuitively  expect  that  all  the  alleles  of  a species  should  cluster 
together  on  a gene  tree,  implying  that  the  common  ancestor  of  all  the  alleles  is 
an  ancestral  allele  within  the  same  species.  A few  gene  trees  have  been  found 
to  have  the  unexpected  property  that  alleles  in  two  or  more  species  appear  to 
be  interdigitated  on  the  tree.  This  pattern,  known  as  shared  polymorphism  or 
trans-species  polymorphism,  has  been  observed  in  major  histocompatibility 
alleles  in  primates  (Lawlor  et  al.  1988),  in  self-incompatibility  alleles  of  plants 
(Ioerger  et  al.  1991),  and  in  several  genes  in  the  melanogaster  species  subgroup 
of  Drosophila  (Hey  and  Kliman  1993).  Figure  8.27  shows  the  probable  means  by 
which  shared  polymorphism  arises,  namely,  that  the  ancestral  species  was 
polymorphic,  and  two  or  more  alleles  remain  in  the  descendant  species  ever 
since  the  time  of  the  common  ancestor.  Recall  that  the  expected  fixation  time 
for  a new  mutation,  given  that  it  goes  to  fixation,  is  4.Y  generations.  This 
means  that  neutral  alleles  are  quite  unlikely  to  remain  polymorphic  for  much 
longer  periods.  Consequently,  observation  of  shared  polymorphism  implies 
that  either  strong  selection  is  retaining  the  alleles  in  the  population,  or  that  the 
species  have  diverged  relatively  recently.  In  the  first  two  examples  above,  there 
is  good  evidence  that  selection  has  maintained  the  polymorphisms,  while  in 


Shared  polymorphism 


Figure  8.27  Trans-species  or  shared  polymorphism  may  occur  if  the  ancestor 
was  polymorphic  for  two  or  more  alleles  and  if  alleles  persist  to  the  present  in 
both  species. 


374  Chapter  8 


the  third  example,  the  Drosophila  species  are  recently  enough  diverged  that 
some  shared  neutral  polymorphisms  are  expected. 

Interspecific  Genetics 

Phylogenetic  inference  from  molecular  sequences  is  a descriptive  goal  in  the 
sense  that  the  primary  objective  is  to  obtain  an  accurate  representation  of  the 
ancestral  history  of  the  species.  Population  genetics  can  also  address  the 
genetic  basis  for  species  differences,  particularly  in  the  case  of  species  in 
which  some  hybrids  are  at  least  partially  fertile.  Although  these  studies  do 
not  directly  address  the  genetic  causes  for  species  origination,  they  are  rele- 
vant to  the  genetic  causes  of  barriers  to  interspecific  gene  flow.  Investigation 
of  the  genetic  basis  for  hybrid  infertility  and  inviability  among  species  in  the 
Drosophila  melanogaster  species  subgroup  (comprising  the  species  melan- 
ogaster,  simulans,  sechellia,  and  mauritiana)  is  a very  active  area.  One  focus  in 
this  work  has  been  an  investigation  of  the  genetic  basis  for  Haldane's  rule, 
which  states  that,  in  interspecific  hybrids  in  which  only  one  sex  is  sterile  or 
inviable,  the  sex  likely  to  be  affected  is  the  heterogametic  sex  (Coyne  1985; 
Coyne  et  al.  1991).  Rather  than  one  or  two  genes  of  large  effect,  interspecific 
hybrid  sterility  appears  to  be  caused  by  many  genes  that  also  have  a complex 
pattern  of  interaction,  so  that  some  particular  combinations  are  sterile  and 
other  combinations  are  fertile  (Palopoli  and  Wu  1994).  A powerful  tool  for 
studying  the  genetic  basis  of  hybrid  sterility  has  been  to  introgress  small 
pieces  of  the  genome  from  one  species  into  the  other.  By  doing  this  for  many 
regions  distributed  all  over  the  genome,  one  can  learn  about  the  relative  roles 
of  the  X chromosome  and  autosomes,  the  relative  incidence  of  male  vs.  female 
infertility,  and  so  forth  (True  et  al.  1996).  Other  features  of  interspecific  differ- 
ences are  amenable  to  genetic  analysis  by  either  introgression  methods 
(applied  to  differences  in  cuticular  hydrocarbons  by  Coyne  1996)  or  by  scor- 
ing an  array  of  anonymous  markers  in  many  backcross  individuals  (applied 
to  genital  arch  morphology  by  Liu  et  al.  1996). 

MULTIGENE  FAMILIES 

Genes  increase  in  number  through  duplication.  Several  successive  rounds  of 
duplication  result  in  a family  of  homologous  genes  with  related  functions,  a 
multigene  family,  the  members  of  which  are  often  arrayed  in  tandem  along 
the  chromosome.  Among  genes  that  normally  exist  in  tandemly  arrayed 
multigene  families  are  the  rRNA  genes  and  the  histone  genes.  Analysis  of  the 
sequences  of  members  of  multigene  families  has  led  to  some  interesting  sur- 
prises. Figure  8.28  shows  a scenario  whereby  a gene  underwent  a duplica- 
tion that  ultimately  became  fixed  in  the  population  either  through  drift  or 
selection.  Subsequently,  sufficient  sequence  divergence  occurred  that  the  two 
genes  could  be  distinguished.  Later  a speciation  event  produced  two  differ- 
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Figure  8.28  Multigene  families  originate  by  a process  of  gene  duplication. 
After  the  duplication  the  genes  may  retain  very  similar  functions  (like  rRNA 
genes),  or  they  may  diverge  (like  globin  genes).  If  the  species  splits  into  two 
species,  then  time  1 and  time  2 depict  the  relationship  between  the  genes  short- 
ly after  speciation  and  long  after  speciation  (see  Figure  8.29). 


ent  species  sharing  this  pair  of  genes.  Figure  8.29  shows  the  gene  genealogies 
at  two  time  points  in  the  evolution  of  this  gene  family.  At  time  1,  the  A genes 
in  species  1 and  2 have  a more  recent  common  ancestor  than  do  genes  A and 
B within  species  1 . At  time  2 the  pairs  of  genes  present  in  the  same  species 
are  more  similar.  This  is  the  pattern  that  is  observed  in  some  multigene  fam- 
ilies. The  close  resemblance  of  A,  with  Blr  and  of  A2  with  B2,  seems  paradox- 
ical, since  both  species  have  the  duplication,  and  Figure  8.28  makes  it  appear 
that  genes  Al  and  A2  in  the  two  species  have  a more  recent  common  ances- 
tor than  do  genes  A ] and  B | . Genes  A j and  Bj,  as  well  as  A2  and  B2,  may  have 
more  similar  sequences  because  the  genes  evolve  together,  in  concert,  under 
the  influence  of  mechanisms  that  operate  to  homogenize  their  sequences. 
This  tendency  toward  homogenization  is  known  as  concerted  evolution. 

Causes  of  Concerted  Evolution 

Two  important  mechanisms  of  concerted  evolution  are  gene  conversion  and 
unequal  crossing-over.  Gene  conversion  is  a process  in  which  nucleotide  pair- 
ing between  two  sufficiently  similar  genes  is  accompanied  by  the  excision  of 
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Figure  8.29  Referring  to  Figure  8.28,  at  Time  1,  genes  A,  and  A2  in  the  two 
species  are  more  similar  to  each  other  than  either  is  to  gene  B,  and  likewise  B , 
and  B2  are  closest  neighbors.  This  tree  reflects  the  fact  that  the  common  ancestor 
of  Ai  and  A2  is  more  recent  than  that  of  A1  and  £>,.  If  at  Time  2 a tree  like  the  bot- 
tom panel  is  observed,  then  sequences  of  A}  and  Bt  have  become  more  similar, 
possibly  by  the  process  of  gene  conversion.  The  bottom  tree  illustrates  the  phe- 
nomenon known  as  concerted  evolution. 


all  or  part  of  the  nucleotide  sequence  of  one  gene  and  its  replacement  by  a 
replica  of  the  nucleotide  sequence  from  the  other  gene.  Formally,  the  result 
is  that  the  sequence  in  one  gene  "converts"  the  sequence  in  the  other  gene  to 
be  exactly  like  itself.  In  unequal  crossing-over,  meiotic  pairing  between  the 
tandem  repeats  in  homologous  chromosomes  is  out  of  register,  and  crossing- 
over  results  in  an  increase  in  the  number  of  copies  in  one  chromosome  and  a 
corresponding  decrease  in  the  number  of  copies  in  the  other  chromosome. 
Repeated  rounds  of  unequal  crossing-over  can  result  in  the  disproportionate 
representation  of  certain  sequences  among  members  of  the  multigene  fami- 
ly, a result  that  is  formally  equivalent  to  gene  conversion. 

A theoretical  model  of  concerted  evolution  has  been  studied  by  Ohta 
(1982).  In  this  model,  a tandemly  arranged  multigene  family  consists  of  a 
fixed  number  of  n members,  and  X is  the  probability  that  a particular  member 
of  the  gene  family  becomes  converted  by  another  member  in  any  one  gener- 
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ation.  (Equivalently,  X is  the  probability  of  completion  of  a cycle  of  unequal 
crossing-over  resulting  in  the  replacement  of  one  sequence  in  the  family  by 
another.)  The  mutation  rate  per  copy  is  p,  and  the  population  number  is  N. 

In  a tandemly  arrayed  multigene  family,  there  are  three  distinct  types  of 
identity  by  descent  (IBD)  among  the  gene  copies  (Figure  8.30): 

1.  Genes  at  different  positions  in  the  same  chromosome  may  be  IBD 
(probability  c^. 

2.  Genes  at  different  positions  in  different  chromosomes  may  be  IBD 
(probability  c2). 

3.  Genes  at  the  same  position  in  different  chromosomes  may  be  IBD 
(probability  f). 

Complex  formulas  for  the  equilibrium  values  of  C\,  c2,  and  /have  been 
derived  by  Ohta  (1982),  but  they  are  greatly  simplified  when  recombination 
within  the  gene  cluster  is  ignored.  In  such  a case,  the  equilibrium  values  are 
approximately 


X 

7.  + («-l)p 

4NXc2  + 1 
4 NX  + 4 Nu  + 1 


8.18 


In  Equations  8.18,  the  quantity  (n  - l)p  is  very  nearly  equal  to  n\x  if  n is 
reasonably  large.  Because  n is  the  number  of  copies  of  the  gene  in  each  tan- 
dem array,  n\i  is  the  total  rate  of  mutation  in  the  multigene  family,  summed 
across  all  copies.  Thus,  the  implication  of  Equation  8.18  is  that  there  is  a deli- 
cate balance  between  the  rate  of  gene  conversion  X and  the  total  mutation 
rate  n\x.  If  the  rate  of  gene  conversion  is  much  greater  than  the  total  mutation 
rate,  then  the  probability  of  IBD  of  genes  at  different  positions  within  the 


Cl 


Figure  8.30  Three  types  of  identity  by  descent  in  multigene  families.  They  are 
the  identity  between  genes  at  homologous  sites  (probability  f),  between  genes  at 
nonhomologous  sites  in  the  same  chromosome  (probability  c{),  and  between 
gene  at  nonhomologous  sites  in  different  chromosomes  (probably  c2).  (After 
Ohta  1982.) 
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family  (cy  and  c2 ) is  close  to  1.0.  On  the  other  hand,  if  X is  much  smaller  than 
the  total  mutation  rate,  then  the  probability  of  IBD  of  genes  at  different  posi- 
tions within  the  family  is  close  to  zero. 

Concerted  evolution  does  not  homogenize  all  multigene  families. 
Depending  on  the  balance  of  the  forces  of  mutation,  gene  conversion,  and 
unequal  crossing-over,  the  pair  of  genes  may  remain  active  and  very  similar, 
or  they  may  diverge  in  function  (such  as  different  tissue-specific  forms  of 
amylase  or  lactate  dehydrogenase),  or  one  gene  may  lose  function  and 
become  a pseudogene.  Multigene  families  can  avoid  the  accumulation  of 
mutations  when  there  is  sufficiently  strong  natural  selection,  and  positive 
selection  is  necessary  for  genes  to  evolve  new  functions.  Walsh  (1988) 
addressed  the  question  of  genes  within  a family  escaping  from  gene  conver- 
sion, and  he  showed  that  higher  mutation  rates  and  lower  conversion  rates 
lead  to  greater  likelihood  for  a gene  escaping  conversion.  Once  a gene  is  suf- 
ficiently divergent  to  have  escaped  conversion,  it  can  either  lose  function 
and  become  a pseudogene  or  it  can  acquire  a new  function.  Simple  models  of 
such  a duplicated  gene  show  that  very  little  selection  is  needed  in  a large 
population  to  avoid  a pseudogene  fate  (Walsh  1995). 

Multigene  Family  Evolution  through  a Birth  and  Death  Process 

Duplicate  genes  can  evolve  in  separate  ways  under  the  influence  of  natural 
selection,  mutation,  and  random  genetic  drift.  In  time,  some  members  of  a 
multigene  family  may  diverge  to  a greater  or  lesser  degree  in  their  function. 
This  process  of  duplication  and  divergence  is  thought  to  be  the  major  mech- 
anism by  which  genes  with  novel  functions  are  created.  Some  multigene 
families  retain  a tandemly  arrayed  structure  and  similarity  in  function  across 
members  despite  the  fact  that  the  differences  between  individual  members 
is  of  functional  significance.  This  pattern  is  particularly  true  of  genes  in  the 
immune  system,  including  immunoglobulin  genes  and  major  histocompati- 
bility genes.  Interspecific  comparisons  of  genes  in  families  of  this  sort  exhibit 
some  genes  that  are  clearly  homologous,  and  others  that  are  more  distantly 
related.  In  addition,  the  rate  of  duplication,  loss  of  function  through  pseudo- 
genes, and  loss  by  deletion,  may  be  fairly  high.  This  kind  of  pattern  of  multi- 
gene family  evolution  is  different  from  concerted  evolution,  because  the  dif- 
ferences between  the  genes  can  be  high  enough  that  intergenic  conversion  is 
very  rare.  Figure  8.31  illustrates  the  distinctness  of  this  pattern  of  gene  evo- 
lution, called  a birth-and-death  process  by  Ota  and  Nei  (1994). 

Figure  8.32  illustrates  the  result  of  duplication  and  divergence  in  two 
related  multigene  families  in  mammals  that  code  for  the  a-like  and  (3-like 
polypeptide  chains  of  hemoglobin.  The  genes  are  specialized  for  different 
periods  of  life.  The  e (epsilon)  genes  are  expressed  in  embryos;  the  Gy  and  Ay 
genes  and  the  a genes  in  the  fetus;  and  the  a,  [3,  and  5 genes  in  the  adult.  The 
inference  from  differences  in  nucleotide  sequence  is  that  the  original 
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Figure  8.31  In  addition  to  concerted  evolution  and  simple  divergent  evolu- 
tion, multigene  families  frequently  exhibit  the  phenomenon  of  genes  being 
added  and  lost  to  families  by  a "birth  and  death  process."  (From  Ota  and  Nei 
1994.) 


a-(3  duplication  took  place  approximately  500  million  years  ago,  when  ver- 
tebrates were  represented  by  the  bony  fishes,  and  the  P-y  duplication  took 
place  about  80  million  years  ago,  during  the  mammalian  radiation.  More 
recent  duplications  have  also  occurred,  for  example  those  leading  to  the  two 
functional  a genes,  the  cluster  of  three  a-like  pseudogenes,  and  the  two  y 
genes.  There  are  several  models  for  the  sequence  of  duplication,  deletion, 
and  conversion  events  that  could  have  led  to  the  current  array  of  globin 
genes  (Goodman  et  al.  1984;  Hardies  et  al.  1984;  Hardison  1984;  Margot  et  al. 
1988),  but  it  appears  well  substantiated  that  the  ancestral  cluster  that  pre- 
dated the  mammalian  radiation  was  5'-£yr|8P  — 3'.  Within  the  mammalian 
radiation,  the  different  orders  of  mammals  evolved  along  different  routes.  In 
prosimian  primates,  such  as  lemurs,  there  was  a fusion  of  r\  and  5.  In  higher 
primates,  including  humans,  there  was  a 8-P  conversion  and  a y duplication. 
In  rodents,  P and  y both  duplicated,  r\  was  deleted,  and  there  was  a 8-P 
fusion,  mediated  probably  by  an  unequal  crossover.  In  rabbits,  q was  delet- 
ed and  there  was  a 8*P  conversion.  Finally,  in  goats,  y was  deleted,  there 
was  a S-p  conversion,  and  the  remaining  four  gene  array  was  then  tripli- 
cated! 

The  evolutionary  history  of  the  fetal  globin  genes  in  humans  reveals  that 
the  Gy  and  Ay  genes  originated  as  part  of  a relatively  recent  5 kb  tandem 
duplication  (Shen  et  al.  1981).  Furthermore,  evidence  from  nucleotide 
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Figure  8.32  Reconstruction  of  the  (3-globin  sequences  in  a series  of  mammals 
illustrates  the  complexity  of  duplication,  loss,  and  gene  conversion  in  this  multi- 
gene family.  (After  Hardison  1984.) 


sequences  strongly  suggests  that  a gene  conversion  event  also  occurred, 
which  converted  part  of  one  particular  Ay  allele  into  a Gy  allele  (Slightom  et 
al.  1980).  The  converted  Ay  allele  is  very  similar  to  a Gy  allele  for  about  1550 
bp  on  the  upstream  (5')  side  of  a putative  recognition  signal  for  gene  conver- 
sion (a  stretch  of  repeating  TG  and  CG  dinucleotides);  but  on  the  down- 
stream (3')  side  of  the  putative  signal,  the  converted  Ay  allele  is  typical  of 
other  Ay  alleles  in  the  human  population.  The  Ay  to  Gy  gene  conversion 
occurred  much  more  recently  than  the  duplication  resulting  in  the  dose 
sequence  similarity  of  the  Ay  and  Gy  genes. 

The  estimate  of  the  time  of  occurrence  of  the  Ay-Gy  duplication  can  be 
improved  by  using  the  nucleotide  sequence  data  from  the  entire  duplicated 
5 kb  region.  In  the  entire  region,  14%  of  the  nucleotide  sites  differ,  which 
translates  into  k = 0.155  ± 0.006;  this  suggests  a time  for  the  duplication  of 
0.155  x 100  x 2.2  x 106  = 34  million  years  (Shen  et  al.  1981). 
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Unequal  crossing-over  in  multigene  families  can  result  in  a decrease  in 
the  number  of  genes  as  well  as  an  increase.  It  is  therefore  not  surprising  that 
deletions  of  one  or  more  of  the  hemoglobin  genes  are  found  in  most  parts  of 
the  world.  Although  usually  very  rare,  in  a few  places  the  frequency  of 
the  deletions  reaches  levels  too  great  to  be  accounted  for  by  chance,  espe- 
cially in  view  of  the  observation  that  the  carriers  are  mildly  to  severely  ane- 
mic. Although  a deletion  of  the  (3-gene  results  in  death  when  homozygous, 
a (3  deletion  and  other  mutations  that  decrease  the  abundance  of  the 
(3-hemoglobin  chain  are  relatively  common  in  the  Mediterranean  Sea  basin 
where  malaria  is  endemic.  For  this  reason,  the  decreased-(3-chain  diseases 
are  called  (3-thalassemias  (literally  translated  as  "sea-anemias").  The  well- 
established  link  between  sickle-cell  anemia  and  malaria,  along  with  the  geo- 
graphical correlation  between  the  (3-thalassemias  and  malaria,  provides  a 
strong  circumstantial  case  for  malarial  parasites  being  an  important  selective 
agent.  Deletion  of  one  or  more  of  the  a-globin  genes  results  in  another  form 
of  anemia  called  cr-thalassemia,  whose  frequency  in  populations  is  also  cor- 
related with  the  incidence  of  malaria. 

Red-green  colorblindness  is  a common  X-linked  disorder  with  a frequen- 
cy of  about  8%  in  Caucasian  males.  The  genes  for  the  red  and  green  visual 
pigments  match  at  98%  of  their  nucleotides,  indicating  that  they  arose  by  a 
relati\ely  recent  duplication.  Individuals  with  normal  color  vision  have  one 
copy  of  the  red  pigment  gene  and  varying  numbers  of  copies  of  the  green 
pigment  gene.  When  genomic  DNA  from  colorblind  males  was  analyzed  by 
Southern  blotting,  those  defective  in  green  vision  were  lacking  fragments  of 
the  green  pigment  gene.  Further  analysis  showed  that  24  of  25  colorblind 
individuals  had  lost  one  or  the  other  pigment  gene  through  gene  rearrange- 
ments that  were  due  either  to  unequal  crossing-over  or  gene  conversion.  In 
this  example,  the  high  sequence  similarity  of  the  red  and  green  pigments 
works  to  human  disadvantage  by  greatly  increasing  the  likelihood  of 
exchange  events  that  lead  to  loss  of  color  vision  (Nathans  et  al.  1986).  The 
relationship  between  the  molecular  basis  of  light  absorption  and  perception 
was  made  particularly  clear  when  it  was  found  that  a normal  polymorphism 
in  red  pigments,  which  confers  a difference  in  the  absorption  peak  of  the  pro- 
tein product,  also  confers  a measurable  difference  in  the  perception  of  color 
balance  (Merbs  and  Nathans  1992). 

Duplication  of  genes  also  occurs  in  plants,  including  a particularly  impor- 
tant gene  in  plants  that  encodes  the  carbon  fixing  enzyme  ribulose-l,5-bis- 
phosphate  carboxylase  (RBC)  (Clegg  et  al.  1997).  The  functional  RBC 
holoenzyme  consists  of  eight  large  and  eight  small  subunits.  Early  in  plant 
evolution,  both  the  large  and  small  subunits  of  RBC  were  encoded  by  the 
chloroplast  genome,  but  the  small  subunit  gene  was  transferred  to  the 
nuclear  genome  at  an  early  stage  and  has  now  been  lost  from  the  chloroplast 
genome.  Diploid  angiosperms  contain  from  two  to  eight  copies  of  the  gene 
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for  the  small  RBC  subunit  ( rbcS ).  All  copies  of  rbcS  appear  to  be  functionally 
equivalent,  and  sequence  analysis  shows  that  the  genes  that  are  closest 
together  in  the  genome  are  also  generally  more  similar  in  sequence.  In 
sequence  comparisons  among  rbcS  genes  of  tobacco  and  tomato,  homologous 
genes  compared  between  the  two  species  are  more  similar  than  within 
species  comparisons  of  gene  copies.  This  finding  is  not  the  pattern  expected 
under  concerted  evolution.  The  variable  number  of  loci  across  angiosperms 
suggests  that  gain  and  loss  of  gene  copies  occurs  to  give  a pattern  like  the 
birth-and-death  process  described  above. 

Structural  RNA  Genes  and  Compensatory  Substitutions 

Transfer  RNA  and  ribosomal  RNA  molecules  derive  their  biochemical  prop- 
erties from  the  secondary  structure  into  which  they  fold.  We  are  still  learning 
the  chemical  rules  by  which  such  macromolecules  attain  their  final  folded 
configuration,  but  one  thing  that  is  very  clear  is  that  complementary  base 
pairing  is  important.  The  stems  of  tRNAs  are  critical  to  maintaining  the  tight- 
ly folded  structure  of  these  essential  molecules.  Substitutions  that  occur  in 
stems  will  weaken  the  stability  of  the  stem  unless  there  is  a compensatory 
change  on  the  other  strand  that  maintains  base  pairing.  Kimura  (1985)  real- 
ized that  one  could  obtain  evidence  for  such  compensatory  changes.  More 
recently,  such  compensatory  changes  have  been  demonstrated  in  an  intron, 
demonstrating  that  the  folding  structure  of  introns  may  also  be  important  to 
regulating  gene  expression  (Kirby  et  al.  1995). 

Further  evidence  of  the  importance  of  secondary  structure  of  rRNA 
comes  from  analysis  of  rRNA  pseudogenes  in  plants  (Buckler  et  al.  1997). 
One  attribute  of  secondary  structure  is  measured  as  the  difference  in  free 
energy  attributable  to  complementary  base  pairing  in  the  folded  vs.  unfold- 
ed state.  Computer  predictions  of  the  best  folding  structure  of  the  rRNA 
pseudogenes  suggested  that  the  difference  in  free  energy  decreases  as  the 
sequences  accumulate  substitutions.  Tests  of  randomly  permuted  sequences 
showed  that  the  functional  rRNA  sequences  are  significantly  more  stable 
than  would  be  obtained  by  chance,  whereas  predicted  pseudogene  RNAs  are 
not.  Some  introns  have  a significantly  open  secondary  structure,  such  that 
random  substitutions  in  their  sequences  result  in  more  stable  structures 
(Leicht  et  al.  1995).  The  reason  some  introns  retain  an  open  structure  may  be 
for  access  to  regulatory  proteins.  This  possibility  has  been  indirectly  demon- 
strated by  showing  that  stable  stems  inserted  into  introns  in  yeast  can  disrupt 
normal  splicing. 

The  ribosomal  RNA  gene  cluster  in  Drosophila  melanogaster  consists  of 
about  200  copies  of  a repeated  unit  on  both  the  X and  the  Y chromosome, 
with  each  repeated  unit  containing  an  18S  and  a 28S  rRNA  gene  separated  by 
an  intergenic  sequence  (IGS)  (Glover  and  Hogness  1977).  The  rRNA  genes 
provide  a clear  example  of  concerted  evolution  because  of  great  interspecific 


Molecular  Population  Genetics 


383 


differences  in  spite  of  a high  degree  of  sequence  conservation  within  species 
(Coen  et  al.  1982).  Furthermore,  within  individuals  of  D.  mercatorum,  there 
appears  to  be  little  sequence  variation,  yet  there  are  clear  differences  between 
individuals  due  to  length  variation  in  the  intergenic  sequence  (Williams  et  al. 
1985).  This  finding  suggests  the  operation  of  a strong  homogenizing  force 
maintaining  sequence  fidelity  within  individuals.  In  humans,  the  rDNA 
repeat  consists  of  a 13  kb  transcribed  portion  and  a 31  kb  spacer  (Wellauer 
and  Dawid  1979).  This  repeated  unit  is  present  in  about  300  copies  located 
near  the  tips  of  the  short  arms  of  five  nonhomologous  chromosomes.  Despite 
the  dispersed  locations,  concerted  evolution  still  occurs  as  evidenced  by 
much  less  variation  among  sequences  within  an  individual  than  among 
species.  Interchromosomal  exchange  events  would  lead  to  conservation  of 
sequence  distal  to  the  rDNA  cluster  on  each  chromosome,  and  evidence  for 
this  conservation  has  been  found  (Worton  et  al.  1988). 

Multigene  Superfamilies 

In  some  cases,  several  sets  of  multigene  families  and  single-copy  genes  may 
share  recognizable  homology,  implying  a common  ancestry,  but  they  have 
undergone  major  divergence  in  function  and  relocation  of  position  within 
the  genome.  These  sets  of  historically  related  but  functionally  distinct  genes 
constitute  a multigene  superfamily. 

The  remarkable  similarities  found  among  portions  of  genes  in  related 
gene  families  has  suggested  that  many  proteins  have  functional  modules  that 
can  be  combined  in  various  ways  in  what  is  called  exon  shuffling.  One 
example  of  shuffling  is  found  in  tissue  plasminogen  activator  (TPA),  which 
has  portions  of  three  other  proteins,  including  plasminogen,  epidermal 
growth  factor,  and  fibronectin.  The  striking  finding  is  that  the  junctions  of 
these  protein  segments  fall  precisely  at  intron-exon  junctions.  The  epidermal 
growth  factor  shares  exon  similarity  with  several  other  proteins,  including 
blood  clotting  factors  IX  and  X,  urokinase,  and  complement  C9  (Doolittle 
1985).  The  gene  for  the  low-density  lipoprotein  (LDL)  receptor  in  human 
beings  extends  over  45  kilobases  and  contains  18  exons  that  show  similarity 
to  a bewildering  variety  of  other  proteins,  including  epidermal  growth  factor 
and  blood  clotting  factors  (Siidhof  et  al.  1985).  Just  as  a computer  program- 
mer recognizes  the  value  of  reusing  subroutine  modules  in  different  pro- 
grams, nature  has  capitalized  on  the  efficiency  of  modular  gene  organization. 

One  extensively  studied  multigene  superfamily  that  serves  diverse  func- 
tions in  immunity  is  illustrated  in  Figure  8.33  (Hood  1985;  Hunkapiller  and 
Hood  1986).  The  primordial  single-copy  gene  may  have  coded  for  a cell-sur- 
face receptor  containing  the  basic  homology  unit  of  the  superfamily,  which  is 
about  110  amino  acids  in  length  with  a strategically  placed  disulfide  bridge 
and  folding  characteristics  enabling  it  to  combine  with  other  similar  units. 
An  early  duplication  and  divergence  of  the  primordial  gene  resulted  in  the 
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Primordial  cell  surface  receptor 


Figure  8.33  Proposed  evolution  of  the  immunoglobulin  multigene  superfami- 
ly from  a primordial  gene  coding  for  a cell-surface  receptor.  Details  of  the  evolu- 
tionary relationships  are  speculative.  The  superfamily  has  diversified  into  12 
single-gene  representatives  (all  of  those  at  the  left,  plus  p2-microglobulin— 

P2-m  at  the  right),  and  eight  multigene  families  (remaining  representatives  at 
the  right).  These  include  genes  for  antibodies,  T-cell  receptors,  major  histocom- 
patibility antigens,  and  other  functions.  The  single-gene  members  include  T-cell 
molecules  implicated  in  MHC  recognition  (CD4  and  CD8)  and  possibly  ion 
channel  formation  (T38,  T3e),  an  immunoglobulin-transport  protein  (poly-Ig),  a 
plasma  protein  (chP-glycoprotein),  two  molecules  restricted  to  lymphocytes  and 
neurons  (Thy-1  and  OX-2),  two  brain-specific  proteins  (N-CAM  and  NCP3),  and 
P2-microglobulin.  The  multigene  families  include  the  heavy  (H)  and  light  (k,  X) 
components  of  antibody  molecules,  the  a,  p,  and  y chains  of  T-cell  receptors,  and 
the  Class  I and  Class  II  molecules  from  the  major  histocompatibility  complex 
(HLA).  (Adapted  from  Hood  et  al.  1985  and  Hunkapiller  and  Hood  1986.) 
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variable  (V)  and  constant  (C)  domains  that  have  been  so  versatile  in  their 
diversification  for  specialized  immune  functions.  In  some  members  of  the 
immunoglobulin  superfamily,  shown  at  the  left  in  Figure  8.33,  the  functional 
products  are  usually  individual  polypeptide  chains,  sometimes  containing 
internal  duplications  of  the  primordial  folding  unit.  These  products  include 
the  poly-Ig  receptor  that  mediates  the  transport  of  immunoglobulin  mole- 
cules across  cell  membranes. 

In  the  other  main  branch  of  the  superfamily,  shown  at  the  right,  the  func- 
tional products  are  usually  aggregates  of  polypeptide  chains.  In  this  branch, 
there  occurred  multiple  duplications  of  the  V regions  and  specialization  of  D 
(diversity)  and  ] (joining)  regions  during  the  evolution  of  the  DNA  splicing 
mechanism  in  lymphocytes,  which  today  results  in  the  tremendous  diversity 
of  antibodies  and  T-cell  receptors.  During  the  formation  of  heavy-chain  anti- 
body genes  in  the  lymphocytes,  any  one  of  a large  number  of  DNA 
sequences  coding  for  the  variable  part  of  the  molecule  can  become  spliced 
with  any  one  of  a small  number  of  DNA  sequences  coding  for  the  constant 
part,  with  diversity  and  joining  regions  incorporated  in  between.  The  many 
possible  V-D-J-C  combinations  enables  enormous  numbers  of  different  pos- 
sible antibodies  to  be  formed,  which  is  increased  still  further  by  slight  varia- 
tion in  the  exact  positions  of  the  splice  junctions.  An  analogous  type  of 
splicing  process  occurs  in  the  formation  of  antibody  light-chain  genes  and  T- 
cell  receptor  genes. 

In  yet  another  offshoot  of  the  immunoglobulin  superfamily,  shown  at  far 
right  in  Figure  8.33,  the  C region  underwent  duplication  and  specialization  to 
form  molecules  of  the  major  histocompatibility  complex  (MHC),  which, 
among  other  functions,  are  necessary  for  the  T cells  of  the  immune  system  to 
recognize  foreign  antigens.  Complete  sequencing  of  a 100  kb  region  of  the  T- 
cell  receptor  gene  family  has  revealed  a spectacular  degree  of  sequence  conser- 
vation between  human  and  mouse  (Koop  and  Hood  1994).  The  opportunities 
for  exceptionally  detailed  analysis  of  multigene  family  evolution  have  enlarged 
with  genomic  sequencing  methods  already  producing  the  complete  sequence 
of  entire  arrays  of  genes  (Rowen  et  al.  1996).  Although  many  aspects  of  the 
immunoglobulin  superfamily  tree  in  Figure  8.33  are  speculative,  the  molecules 
are  undoubtedly  related  because  comparison  of  the  relevant  units  gives  15  to 
40%  homology  at  the  amino  acid  level,  and  at  the  DNA  level  each  homology 
unit  is  encoded  in  a separate  exon.  The  immunoglobulins  thus  demonstrate  the 
immense  evolutionary  potential  of  repeated  rounds  of  duplication  and  diver- 
gence through  specialization  of  function. 

Dispersed  Highly  Repetitive  DNA  Sequences 

A second  major  class  of  highly  repetitive  DNA  in  eukaryotes  is  not  localized 
in  clusters  of  tandemly  repeating  units,  but  is  dispersed  throughout  the 
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genome  with  single-copy  sequences.  The  importance  of  dispersed  repetitive 
elements  to  the  human  genome  project  is  made  clear  by  the  realization  that 
they  constitute  35%  of  our  genome  (Smit  1996).  In  vertebrates,  this  dispersed 
highly  repetitive  DNA  occurs  primarily  in  two  categories,  denoted  SINEs 
and  LINEs  (Singer  1982).  SINEs  (short  interspersed  elements)  are  sequences 
typically  shorter  than  500  base  pairs  which  occur  in  10^  or  more  copies  in  the 
genome.  Like  tRNA  genes,  they  contain  internal  transcriptional  start  sites 
and  are  transcribed  by  RNA  polymerase  III.  LINEs  (long  interspersed  ele- 
ments) are  sequences  typically  greater  than  5000  base  pairs  that  occur  in  104 
or  more  copies  in  the  genome.  They  are  processed  pseudogenes  (see  below) 
and,  when  transcribed,  are  transcribed  by  RNA  polymerase  II.  Marked  dif- 
ferences in  the  particular  array  of  subfamilies  of  SINEs  and  LINEs  or  both 
are  frequently  observed  among  even  closely  related  species  (Figure  8.34).  The 
mechanisms  and  possible  significance  of  such  massive  and  rapid  changes  in 
repetitive  DNA  in  the  genome  are  very  obscure. 

One  example  of  SINEs  in  human  DNA  is  the  Alu  family,  named  because 
the  sequence  contains  a characteristic  restriction  site  for  the  restriction 
enzyme  Alul.  The  Alu  sequence  is  about  300  nucleotides  in  length.  Alu 
sequences  are  present  in  approximately  one  million  copies  in  the  human 
genome  and  constitute  approximately  ten  percent  of  the  total  DNA  (Smit 
1996).  Sequences  closely  related  to  Alu  are  found  in  other  primates,  and  more 
distantly  related  sequences  occur  in  rodents  and  probably  in  all  placental 
mammals.  Two  randomly  chosen  human  Alu  sequences  differ,  on  the  aver- 
age, at  15  to  20%  of  their  nucleotide  sites,  which  calculates  to  a time  of  diver- 
gence of  between  16.7  and  23.3  million  years.  In  the  human  genome  there  is 
an  Alu  element  an  average  of  every  3 to  5 kb,  but  the  distribution  is  not  uni- 
form. For  example,  the  (3-tubulin  and  thymidine  kinase  gene  regions  have 
about  10  times  the  average  density  of  Alu  repeats  (Slagel  et  al.  1987),  and  Alu 
repeats  show  a preference  for  integrating  into  oligo-dA  runs  (Daniels  and 
Deininger  1985). 


PROBLEM  8. 11  The  third  chromosome  of  Drosophila  pseudoobscura 
is  polymorphic  for  more  than  a dozen  inversions  that  result  in  differ- 
ent gene  orders.  Polymorphisms  of  this  sort  are  different  from  nucle- 
otide site  substitutions  because  they  retain  some  information  about 
the  order  of  events.  Consider,  for  example,  the  sequences  A-B-C-D-E 
and  C-E-A-D-B.  Can  you  deduce  the  order  of  the  events  that  connect 
them? 
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Figure  8.34  A dot  plot  comparison  of  the  human  and  rabbit  sequences  span- 
ning 5-  and  P-globins.  Each  dot  represents  a small  bit  of  sequence  similarity, 
much  of  the  background  due  solely  to  chance,  and  the  regions  of  extended  simi- 
larity stand  out  as  diagonal  line  segments.  The  scales  are  in  kilobases,  and  the 
rectangles  indicate  the  location  and  organization  of  the  globin  genes.  The  solid 
arrows  show  the  location  of  a rabbit  LI  repeat,  and  open  triangles  indicate 
human  Alu  sequences  and  rabbit  OcC  repeats  (a  rabbit  SINE).  The  major  diago- 
nal line  indicates  that  there  is  noticeable  homology  retained  through  the  5-|3 
intergenic  region,  and  the  sequence  similarity  of  human  P-globin  with  rabbit  8- 
globin  (and  vice  versa)  is  evident.  (From  Margot  et  al.  1988.) 
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ANSWER  From  A-B-C-D-E,  the  first  inversion  must  have  been  the 
segment  A-B-C,  giving  the  sequence  C-B-A-D-E.  Next,  the  segment 
A-D  inverted  to  give  C-B-D-A-E.  Finally,  the  segment  B-D-A-E  invert- 
ed to  give  C-E-A-D-B.  Much  more  elaborate  problems  of  inference 
have  arisen  to  determine  the  ancestral  series  of  inversions  and  num- 
ber of  events  needed  to  go  from  one  gene  order  to  another.  Computer 
scientists  refer  to  this  problem  as  "sorting  by  reversals."  You  can  see 
that  given  any  random  ordering  of  integers,  a finite  number  of  inver- 
sions or  reversals  will  put  them  into  the  correct  order.  Motivated  by 
the  biological  problem,  an  algorithm  for  finding  the  minimum  num- 
ber of  reversals  to  go  from  one  order  to  another  was  recently  imple- 
mented (Bafna  and  Pevzner  1996).  As  more  genomes  are  fully 
mapped  and  sequenced,  this  is  likely  to  be  an  area  of  considerable 
excitement.  Ehrlich  et  al.  (1997)  recently  estimated  that  the  number  of 
rearrangements  that  were  required  to  connect  the  human  and  mouse 
genetic  maps  as  about  180. 


An  example  of  LINEs  in  the  human  genome  is  the  LI  family  of  sequences 
(also  called  LINE-1  or  Kpn,  because  of  a characteristic  restriction  site).  The  LI 
sequences  average  about  2,000  nucleotides,  and  the  50,000  copies  of  the 
sequence  in  the  human  genome  account  for  about  4%  of  the  total  DNA.  As 
with  the  Alu  family,  sequences  related  to  LI  are  found  in  other  mammals, 
including  the  mouse  (Hardies  et  al.  1986)  and  the  rabbit  (Demers  et  al.  1986). 
Not  all  insertions  of  LI  sequences  are  innocuous.  Kazazian  et  al.  (1988)  found 
two  cases  of  hemophilia  A that  were  caused  by  de  novo  insertions  of  an  LI 
sequence  into  exon  14  of  the  factor  VIII  gene,  whose  function  is  necessary 
for  normal  blood  coagulation.  This  insertional  mutation  event  was  evidently 
mediated  by  an  RNA  intermediate  and  provides  a mechanism  for  natural 
selection  to  operate  on  LI  elements.  Another  deleterious  mutation  caused  by 
a transposable  element  in  humans  was  an  insertion  of  an  LI  sequence  into 
the  myc  oncogene  in  a human  breast  cancer  (Morse  et  al.  1988). 

In  their  molecular  organization,  LINE  sequences  strongly  resemble  a class 
of  pseudogenes  known  as  processed  pseudogenes.  Processed  pseudogenes  are 
thought  to  result  from  the  reverse  transcription  of  an  RNA  molecule  into 
DNA,  followed  by  insertion  of  the  DNA  into  the  genome.  The  reverse  tran- 
scription and  integration  process  can  be  carried  out  by  an  enzyme  called 
reverse  transcriptase,  which  is  coded  in  the  genome  of  a class  of  RNA- 
containing  viruses  called  retroviruses.  In  cells  infected  with  retrovirus,  the 
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reverse  transcriptase  makes  a DNA  copy  of  the  viral  RNA,  and  another 
enzyme  inserts  the  DNA  into  the  chromosome.  When  reverse  transcription 
and  integration  happen  to  a processed  RNA  molecule,  the  result  is  a dispersed 
duplicate  copy  that  is  generally  transcriptionally  inactive  due  to  loss  of  regu- 
latory sequences.  Such  a sequence  is  known  as  a processed  pseudogene. 
Many  genes  are  known  to  have  processed  pseudogene  counterparts,  includ- 
ing the  genes  for  human  K-immunoglobulin  and  (3-tubulin,  rat  a-tubulin  and 
cytochrome  c,  and  mouse  a-globin.  Not  all  genes  that  have  been  processed 
through  an  RNA  intermediate  are  pseudogenes.  Human  phosphoglycerate 
kinase  ( PGK)  occurs  as  an  active  X-linked  gene,  a processed  X-linked  pseudo- 
gene, and  an  autosomal  gene  with  remarkable  properties.  The  normal  PGK-1 
gene  contains  11  exons  and  10  introns,  but  the  autosomal  gene  has  no  introns 
and  has  remnants  of  a poly-A  tail,  strongly  implying  that  it  was  reverse  tran- 
scribed from  an  RNA  transcript.  The  intron-free  autosomal  gene  ( PGK-2 ) is 
expressed  in  human  testes  (McCarrey  and  Thomas  1987). 

The  processed  pseudogene  model  of  dispersed  repeated  DNA  evolution 
is  illustrated  in  Figure  8.35  (Hardies  et  al.  1986).  The  functional,  transcribed 
copies  of  the  gene  family  are  shown  at  the  top,  and  the  horizontal  arrows  rep- 
resent gene  conversion,  which  promotes  concerted  evolution  of  the  function- 
al genes.  The  gene  in  the  center  is  a preferred  donor  for  gene  conversion 
( biased  gene  conversion).  Emanating  from  the  functional  genes  are  numerous 


Functional 

transcribed 

genes 


Processed 

pseudogenes 


Mutation,  random 
genetic  drift,  deletion 


Figure  8.35  Model  for  the  evolution  of  a dispersed  highly  repetitive  family  of 
processed  pseudogenes.  A small  number  of  functional  genes  (top),  which 
undergo  concerted  evolution  by  means  of  gene  conversion,  are  transcribed 
under  conditions  that  favor  reverse  transcription  and  integration  into  numerous 
dispersed  chromosomal  locations.  The  resulting  nonfunctional  genes  undergo 
mutation  and  random  genetic  drift,  and  are  ultimately  eliminated  by  deletion  or 
other  mechanisms.  (From  Hardies  et  al.  1986.) 
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copies  of  processed  pseudogenes  distributed  throughout  the  genome.  These 
copies  are  essentially  functionless  and  undergo  sequence  divergence  pro- 
moted by  mutation  and  random  genetic  drift,  which  is  offset  in  part  by  gene 
conversion  and  other  homogenizing  processes  among  the  pseudogenes. 
Eventually  the  pseudogene  sequences  are  cleared  from  the  genome  by  dele- 
tion or  extreme  sequence  rearrangement  or  divergence. 

One  implication  of  the  model  in  Figure  8.35  is  that,  eventually,  a balance 
is  reached  in  which  the  clearance  of  old  pseudogenes  from  the  genome  is 
equaled  by  the  creation  and  insertion  of  new  ones.  In  the  equilibrium  state 
there  is  a steady  turnover  among  sequences  in  the  family,  but  the  total  num- 
ber neither  grows  nor  shrinks.  Studies  of  a dispersed  repeated  sequence  in 
the  mouse  related  to  human  LI  suggest  a turnover  with  a half-life  of  approx- 
imately two  million  years.  That  is,  after  two  million  years,  half  the  members 
of  the  gene  family  will  have  been  removed  and  replaced  with  new  ones. 
However,  the  LI  family  may  evolve  more  rapidly  than  is  typical. 

The  very  abundance  of  pseudogenes  implies  that  many  unrelated  genes 
may  have  pseudogenes  in  the  same  vicinity,  as  is  the  case  with  Alu  sequences 
interspersed  in  the  (3-globin  cluster.  Some  fraction  of  these  linked  pseudo- 
genes may  alter  the  level,  timing,  or  tissue  distribution  of  transcription  of  the 
genes  to  which  they  are  linked,  or  they  may  have  subtle  effects  on  chromatin 
structure  that  affect  gene  expression.  Through  any  of  a diversity  of  mecha- 
nisms, pseudogene  copies  of  dispersed  highly  repeated  gene  families  could, 
in  principle,  have  effects  on  phenotype  and  thus  be  subject  to  the  influence  of 
natural  selection.  While  true  in  principle,  such  effects  have  not  yet  been 
demonstrated.  To  the  extent  that  such  effects  can  safely  be  ignored,  the  evo- 
lutionary mechanism  of  highly  dispersed  repeated  DNA  sequences  is  that  of 
selfish  DNA,  subject  to  the  conflicting  forces  of  neutral  mutation/random 
drift  and  the  diverse  homogenizing  processes  of  concerted  evolution. 


SUMMARY 

The  discipline  of  molecular  population  genetics  has  as  its  theoretical  foun- 
dation the  neutral  theory,  which  provides  a rich  set  of  testable  hypotheses 
about  the  mechanisms  that  modify  patterns  of  sequence  divergence  and 
sequence  polymorphism.  We  saw  that  underlying  models  must  be  specified 
even  to  do  seemingly  straightforward  things  like  estimating  rates  of  substi- 
tution. The  reason  substitution  rate  estimates  are  not  trivial  is  that,  with 
greater  divergence,  subsequent  mutations  may  not  further  increase  the 
divergence  if  the  site  has  already  been  substituted.  From  observed  counts  of 
amino  acid  or  nucleotide  differences,  we  usually  want  to  estimate  numbers 
of  changes  per  site.  The  model  for  amino  acid  substitution  is  not  very  diffi- 
cult because  there  are  20  amino  acids,  but  even  the  simplest  nucleotide  sub- 
stitution model  of  Jukes  and  Cantor  is  subtle.  More  complicated  models 
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account  for  differences  in  rates  of  transition  and  transversion  substitutions, 
and  it  immediately  becomes  apparent  that  both  the  process  of  mutation  and 
of  substitution  can  be  of  any  imagined  degree  of  complexity. 

Out  of  sequence  analyses  there  emerges  the  pleasing  generalization  that 
many  sequences  appear  to  diverge  at  an  approximately  clock-like  rate.  This 
molecular-clock  concept  should  be  interpreted  somewhat  loosely,  because 
rigorous  statistical  tests  have  identified  significant  irregularities  in  its  rate.  In 
addition,  there  are  dramatic  differences  in  rate  of  evolution  across  genes, 
because  the  neutral  substitution  rate  differs  from  one  gene  to  the  next.  Some 
lineages  appear  to  have  accelerated  or  decelerated  clock  rates,  and  one  cause 
for  the  variation  is  a change  in  generation  time  (for  example,  from  rodents  to 
primates). 

Synonymous  and  nonsynonymous  substitutions  have  different  effects  on 
the  protein  product,  so  estimating  the  rates  of  these  two  kinds  of  substitu- 
tion independently  can  be  informative  about  the  causes  of  evolutionary 
change.  For  example,  most  genes,  like  Drosophila  Adh,  have  a large  excess  of 
synonymous  changes,  an  observation  that  is  accounted  for  by  the  presumed 
deleterious  effect  of  most  amino  acid  replacements.  All  synonymous  codons 
are  not  used  with  equal  frequency,  and  the  bias  in  codon  usage  implies  that 
even  synonymous  substitutions  may  not  be  selectively  neutral.  The  most  sen- 
sitive tests  for  selection  make  use  of  comparisons  between  intraspecific  poly- 
morphism and  interspecific  divergence.  Under  strict  neutrality,  these  two 
quantities  should  be  related  to  one  another,  and  departures  in  either  direction 
can  be  detected  through  heterogeneity  among  genes. 

The  neutral  theory  also  makes  predictions  about  the  shape  of  gene  trees, 
and  there  has  been  a great  deal  of  excitement  about  the  possibility  of  testing 
hypotheses  about  evolutionary  forces  based  on  inferred  gene  genealogies. 
(Problem  8.8  gives  one  example.)  Gene  trees  have  been  used  to  test  hypothe- 
ses about  selection,  recombination,  homogeneity  of  mutation,  and  even 
migration.  The  ability  to  account  for  the  patterns  of  correlation  built  up  by 
the  ancestral  history  of  genes  has  been  a major  advance  in  statistical  popula- 
tion genetics. 

Organelle  genome  evolution  occupies  an  important  position  in  the  devel- 
opment of  molecular  population  genetics,  in  part  because  of  the  numerous 
studies  of  mtDNA  and  cpDNA  variation.  Of  particular  interest  and  contro- 
versy was  the  work  on  human  mtDNA  variation,  which  raised  many  intrigu- 
ing problems  about  human  origins.  This  work  stimulated  a huge  amount  of 
theoretical  study  concerning  the  statistical  inferences  that  could  be  made 
from  sample  data,  including  times  of  common  ancestry,  inference  of  past 
demographic  histories,  and  so  forth.  Several  recent  studies  have  shown  that 
mtDNA  exhibits  patterns  consistent  with  the  past  operation  of  natural  selec- 
tion, in  violation  of  many  of  these  models. 
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Molecular  phylogenetics  seeks  to  reconstruct  the  ancestral  history  of 
extant  organisms,  and  shares  many  analytical  procedures  with  molecular 
population  genetics.  There  are  several  widely  used  algorithms  for  recon- 
structing a tree  from  sequence  data,  and  we  examined  in  some  detail  the 
UPGMA  method,  least-squares,  neighbor-joining,  and  parsimony  methods. 
One  of  the  more  intriguing  patterns  of  variation  to  emerge  from  such  inter- 
specific comparisons  is  that  of  shared  polymorphism,  in  which  two  or  more 
species  share  a number  of  alleles  in  common.  It  is  unlikely  that  shared  poly- 
morphism would  be  maintained  for  long  by  chance,  so  it  is  not  surprising 
that  cases  of  shared  polymorphism  are  generally  found  in  genes  known  to  be 
under  strong  selection  or  in  species  that  have  recently  diverged. 

When  multiple  copies  of  similar  genes  exist  in  the  genome,  they  can 
exchange  sequences  through  unequal  recombination  and  gene  conversion. 
Such  exchanges  can  result  in  concerted  evolution,  a process  whereby  genes  in 
a multigene  family  are  very  similar  to  one  another  within  a species,  even 
though  the  duplication  events  that  gave  rise  to  the  family  occurred  far  in  the 
past.  Not  all  multigene  families  undergo  concerted  evolution.  A more  com- 
mon finding  is  that  many  multigene  families  exist  as  groups  of  genes  with 
related  function  that  have  diverged  enough  in  sequence  to  escape  gene  con- 
version. In  this  case,  new  genes  appear  by  duplication  and  old  ones  disap- 
pear by  deletion,  sometimes  preceded  by  inactivating  mutations  that 
generate  pseudogenes.  This  birth-and-death  process  gives  rise  to  complex 
patterns  of  relationships  among  genes  within  gene  families. 

PROBLEMS 

1.  Suppose  that  you  have  sequences  of  gene  A and  gene  B from  each  of  two 
species.  The  fraction  of  sites  that  differ  in  gene  A is  0.7  and  the  fraction  of 
sites  that  differ  in  gene  B is  0.05.  Apply  the  Jukes-Cantor  formula  to 
obtain  the  estimate  of  the  number  of  substitutions  per  site  for  each  gene. 
Which  gene  do  you  think  would  have  a smaller  estimate  of  variance  of 
substitution  rate?  Why? 

2.  Suppose  you  discover  a community  of  deep  sea  creatures  that  have  very 
unusual  DNA  that  has  not  four  bases  but  six.  Adenine  and  thymine  pair, 
and  guanine  and  cytosine  pair  just  like  most  DNA,  but  there  are  also  niti- 
dine  and  liondine,  which  also  pair.  You  obtain  sequences  from  two  of  these 
creatures  and  determine  that  20%  of  the  sites  mismatch  in  aligned 
sequences.  From  this  figure,  estimate  the  number  of  substitutions  per  site 
that  have  occurred  since  the  common  ancestor  of  the  two  species.  (Hint: 
You  know  that  the  number  is  higher  than  0.20,  because  back  mutations 
could  have  occurred.  Derive  an  expression  like  the  Jukes-Cantor  formula.) 

3.  The  following  is  a small  portion  of  the  gene  coding  for  6-phosphoglu- 
conate  dehydrogenase  in  two  natural  isolates  of  £.  coli. 
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1 CTC  ACC  AAA  ATC  GCC  GCC  GTA  GCT  GAA  GAC  GGT  GAA  CCA  TGC  GTT  ACC  TAT  ATT  GGT  GCC 

2 CTG  AAG  CAG  ATC  GCG  GCG  GTT  GCT  GAA  GAC  GGT  GAG  CCG  TGT  GTG  ACT  TAT  ATA  GGT  GCC 

Infer  the  correct  translational  reading  frame  of  the  sequences  and  esti- 
mate: 

a.  the  number  of  amino  acid  differences/site. 

b.  the  number  of  nucleotide  differences/site. 

c.  the  number  of  nonsynonymous  substitutions  per  nonsynonymous  site 
(regarding  codon  sites  1 and  2 as  nonsynomyous). 

d.  the  number  of  synonymous  substitutions  per  synonymous  site 
(regarding  codon  site  3 as  synonymous). 

4.  In  the  human  immunodeficiency  virus  HIV,  which  causes  acquired 
immune  deficiency  syndrome  (AIDS),  the  rate  of  nucleotide  evolution 
has  been  estimated  at  about  0.01  substitutions  per  synonymous  site  per 
year.  Two  viruses  isolated  in  1983  in  Zaire  and  San  Francisco  differ  in 
approximately  one  third  of  their  synonymous  sites.  Estimate  the  year  in 
which  the  viruses  last  shared  a common  ancestor.  (Data  from  Li  et  al 
1988.) 

5.  The  data  below  give  the  proportion  of  nucleotide  sites  that  differ  in  a 
gene  in  four  RNA  viruses  (Yokoyama  et  al.  1988).  HIV1  and  HIV2  are  two 
rather  distinct  types  of  human  immunodeficiency  viruses,  VISNA  is  a 
lentivirus,  and  MMLV  is  a mouse  cancer-causing  virus.  Estimate  the 
number  of  nucleotide  substitutions  per  site  using  these  data.  What  do  the 
numbers  imply  about  the  evolutionary  relationships  among  the  viruses? 


HIV2 

VISNA 

MMLV 

HIV1 

0.34 

0.54 

0.62 

HIV2 

0.52 

0.63 

VISNA 

0.63 

6.  What  inference  would  you  make  regarding  the  selective  constraints  on  a 
region  of  DNA  in  which  the  rate  of  evolution  was  5 x 10”6 7 8 9  nucleotide  sub- 
stitutions per  site  per  year? 

7.  What  might  you  infer  about  the  evolutionary  forces  affecting  a coding 
region  in  which  the  rate  of  amino  acid  replacement  was  greater  than  the 
rate  of  synonymous  nucleotide  substitution? 

8.  Ribsomal  RNA  forms  a complex  secondary  structure  in  which  many 
regions  of  the  molecules  are  folded  back  and  undergo  base  pairing  with 
complementary  nucleotide  sequences  elsewhere  in  the  same  molecule. 
What  pattern  of  nucleotide  sequence  evolution  might  be  expected  in 
these  paired  regions? 

9.  What  is  the  largest  value  of  d that  makes  sense  in  Equation  8.15  and  what 
does  it  mean? 
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10.  If  the  rate  of  nucleotide  evolution  along  a lineage  is  0.5%  per  million 
years,  what  is  the  rate  of  substitution  per  nucleotide  per  year?  What  is  the 
total  rate  of  divergence  of  two  lineages? 

11.  While  analyzing  the  DNA  sequences  of  two  copies  of  a gene,  you  find 
that  there  are  a total  of  34  synonymous  substitutions  and  16  nonsynony- 
mous  substitutions.  Using  the  method  of  Nei  and  Gojobori,  you  find  that 
there  were  310  synonymous  nucleotide  sites  and  633  nonsynonymous 
sites.  If  possible,  estimate  the  rates  of  synonymous  and  nonsynonymous 
substitution,  and  interpret  the  result. 

12.  If  the  effective  size  of  a diploid  population  is  N with  respect  to  autosomal 
genes,  what  is  it  with  respect  to 

a.  X-linked  genes? 

b.  Y-linked  genes? 

c.  mtDNA? 

13.  Analysis  of  mtDNA  in  humpback  whales  (Baker  et  al.,  1990,  Nature 
344:238-240)  has  shown  that  not  only  do  the  Atlantic  and  Pacific  popula- 
tions show  differences,  but  there  are  clear  geographic  subpopulations 
within  oceans  despite  the  lack  of  geographic  barriers.  Such  a pattern  may 
be  observed  if  either:  (1)  there  were  a low  rate  of  migration  and  a low  rate 
of  mtDNA  sequence  divergence,  or  (2)  a higher  rate  of  migration  with  a 
higher  rate  of  mtDNA  sequence  divergence.  Can  you  distinguish  these 
two  possibilities?  Can  you  separately  estimate  the  rate  of  neutral  muta- 
tion and  the  rate  of  migration  in  a subdivided  population? 

14.  Suppose  the  phylogeny  of  five  species  is  {((A,B)C)R(D,E)),  where  R des- 
ignates the  root.  Can  you  ascribe  the  substitution  events  of  the  following 
data  uniquely  to  branches  on  this  genealogy?  Number  the  sites  1-10  and 
label  the  substitutions  by  site  number  on  the  tree. 


Species  A 

TAG 

CTG 

ATC 

A 

Species  B 

TAG 

CCG 

AGC 

A 

Species  C 

TAC 

CCG 

ATT 

G 

Species  D 

TAC 

CCT 

ATC 

A 

Species  E 

TGC 

CCT 

ATC 

A 

15.  For  an  ideal  population  of  effective  size  N,  the  average  time  to  loss  of  a 
new  mutation  destined  to  be  lost  is  21n(2 N),  and  the  average  time  to  fixa- 
tion of  a new  mutation  destined  to  be  fixed  is  4 N.  For  what  values  of  N 
does 

a.  Fixation  time  = 10  x loss  time? 

b.  Fixation  time  = 100  x loss  time? 

16.  For  the  model  of  gene  conversion  with  gene  identities  given  in  Equation 
8.18,  what  value  of  X makes  the  organization  of  the  gene  family  irrelevant 
in  the  sense  that/=  cl  = c2?  What  is  the  common  value  in  this  case?  (X  is 
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the  probability  that  a particular  member  of  the  gene  family  becomes  con- 
verted in  any  one  generation.) 

17.  For  the  model  of  gene  conversion  with  gene  identities  given  in  Equation 
8.18,  what  are  the  values  of  / and  Cj  = c2  when  X = p?  (The  equations 
assume  4Np  « 1.) 

18.  In  a repetitive  gene  family  being  eliminated  from  the  genome  by  dele- 
tion, if  the  fraction  of  sequences  present  at  time  0 that  are  still  present  at 
time  t equals  exp  (-/if),  show  that  the  half  life  of  the  sequences  equals 
-In  m/h. 

19.  For  a repetitive  gene  family  eliminated  as  described  in  Problem  18,  show 
that  the  average  persistence  of  an  element  is  l/h. 
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any  important  problems  in  evolutionary  biology  begin  with 
observations  of  phenotypic  variation.  Darwin  formulated  his 
ideas  about  evolution  by  natural  selection  based  on  observations 
of  phenotypic  variation.  He  struggled  for  many  years  to  explain  the  cause  of 
the  phenotypic  variability,  but  he  was  unsuccessful  at  one  level  because  he 
did  not  know  about  Mendelian  genetics.  Darwin  did,  however,  appreciate 
the  importance  of  the  observation  that  offspring  resemble  their  parents.  Con- 
tinuously varying  traits,  like  body  size,  are  influenced  by  both  genetic  and 
environmental  factors.  Crossing  experiments  demonstrate  that  the  genetic 
components  of  these  traits  are  not  determined  by  single  genes  because  the 
offspring  do  not  fall  into  discrete  classes  with  simple  Mendelian  ratios. 
Instead,  what  is  observed  is  a general  resemblance  between  parents  and  off- 
spring, suggesting  that  there  is  an  underlying  genetic  basis  to  the  trait,  but 
that  the  genetic  transmission  is  complex. 

A wealth  of  statistical  tools  have  been  developed  for  analyzing  such  poly- 
genic traits  that  do  not  show  simple  Mendelian  transmission.  These 
approaches  allow  not  only  a description  of  the  genetic  basis  of  observed  phe- 
notypic distributions,  but  they  also  provide  a means  of  predicting  the  distri- 
butions of  phenotypes  among  offspring  from  observation  of  the  parental 
phenotypes.  Most  polygenic  traits  are  influenced  by  the  environment  to 
varying  degrees,  and  they  are  often  called  multifactorial  traits  to  emphasize 
their  determination  by  multiple  genetic  and  environmental  factors.  For 
example,  variation  in  human  weight  is  partly  due  to  genetic  differences 
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among  individuals  and  partly  due  to  environmental  factors  such  as  exercise 
and  level  of  nutrition.  The  study  of  polygenic  inheritance  goes  beyond  an 
oversimplified  nature-versus-nurture  dichotomy  because  it  is  concerned  with 
specifying,  in  precise  quantitative  terms,  the  relative  importance  of  nature, 
nurture,  and  their  interactions,  in  accounting  for  variation  in  phenotype 
among  individuals.  Another  compelling  reason  to  study  polygenic  inheri- 
tance is  that  natural  selection  occurs  at  the  level  of  the  composite  phenotype, 
and  so  fitness  is  a multifactorial  trait. 

Since  natural  selection  operates  on  phenotypes,  there  arises  an  immediate 
problem  in  understanding  how  phenotypic  evolution  is  reflected  in  changes 
that  occur  at  the  molecular  level.  One  of  the  great  challenges  facing  popula- 
tion genetics  is  to  unify  the  principles  of  molecular  evolution  with  those  gov- 
erning evolution  at  the  phenotypic  level. 

TYPES  OF  QUANTITATIVE  TRAITS 

Multifactorial  traits  may  be  considered  as  resulting  from  the  combined 
effects  of  many  quantities,  some  genetic  in  origin  and  some  environmental, 
and  for  this  reason  they  are  often  called  quantitative  traits.  The  study  of 
quantitative  traits  constitutes  quantitative  genetics. 

Three  types  of  quantitative  traits  may  be  distinguished: 

1.  Traits  for  which  there  is  a continuum  of  possible  phenotypes  are  continu- 
ous traits;  examples  include  height,  weight,  milk  yield,  and  growth  rate. 
The  distinguishing  feature  of  continuous  traits  is  that  the  phenotype  can 
take  on  any  one  of  a continuous  range  of  values.  In  theory,  there  are  infi- 
nitely many  possible  phenotypes,  among  which  discrimination  is  limited 
only  by  the  precision  of  the  instrument  used  for  measurement.  However, 
in  practice,  similar  phenotypes  are  often  grouped  together  for  purposes  ' 
of  analysis. 

2.  Traits  for  which  the  phenotype  is  expressed  in  discrete,  integral  classes 
are  meristic  traits;  examples  include  number  of  offspring  or  litter  size, 
number  of  ears  on  a stalk  of  corn,  number  of  petals  on  a flower,  and  num- 
ber of  bristles  on  a fruit  fly.  The  distinguishing  feature  of  meristic  traits  is 
that  the  phenotype  of  an  individual  is  given  by  an  integer  that  equals  the 
number  of  elements  of  the  trait  that  the  individual  displays.  For  example, 
a popular  meristic  trait  used  in  experimental  studies  of  quantitative 
genetics  in  Drosophila  is  the  number  of  bristles  that  occur  on  the  abdomi- 
nal segments  or  sternites.  Normally  there  are  14  to  24  bristles  per  sternite. 
A male  with  19  bristles  on  the  fifth  abdominal  sternite  therefore  has  a 
phenotype  of  19.  The  distribution  of  numbers  of  abdominal  bristles  in  a 
sample  of  Drosophila  appears  in  Figure  9.1.  When  the  number  of  possible 
phenotypes  of  a meristic  trait  is  large  (as  it  is  with  abdominal  bristle 
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Figure  9.1  Number  of  bristles  on  the  fifth  abdominal  sternite  in  males  of  a 
strain  of  Drosophila  melanogaster.  The  smooth  curve  is  that  of  a normal  distribu- 
tion with  mean  18.7  and  standard  deviation  2.1.  (Data  from  T.  Mackay.) 


number)  then  the  line  between  continuous  traits  and  meristic  traits 
becomes  indistinct. 

3.  The  third  category  of  quantitative  traits  consists  of  discrete  traits,  which 
are  either  present  or  absent  in  any  one  individual.  In  these  cases,  the 
multiple  genetic  and  environmental  factors  combine  to  determine  an 
underlying  risk  or  liability  toward  the  trait.  Liability  values  are  not 
directly  observable.  However,  an  individual  that  actually  expresses  the 
trait  is  assumed  to  have  a liability  value  greater  than  some  threshold  or 
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^§§®r^n§  level.  Traits  of  this  type  are  called  threshold  traits,  and  exam- 
ples in  human  genetics  include  diabetes  and  schizophrenia.  With  thresh- 
old traits,  studies  of  affected  individuals  and  their  relatives  permit 
inferences  to  be  made  about  the  underlying  values  of  liability.  These 
methods  are  discussed  later  in  this  chapter. 

Quantitative  traits  are  of  utmost  importance  to  plant  and  animal  breeders, 
because  agriculturally  important  characteristics  such  as  yield  of  grain,  egg 
production,  milk  production,  efficiency  of  food  utilization  by  domesticated 
animals,  and  meat  quality  are  all  quantitative  traits.  Even  as  modern  methods 
of  genetic  engineering  are  applied  to  animal  and  plant  improvement,  quanti- 
tative genetics  continues  to  play  an  important  role  because  commercially 
desirable  traits  result  from  complex  interactions  among  many  genes.  In  addi- 
tion to  being  essential  ingredients  in  plant  and  animal  improvement  pro- 
grams, the  principles  of  quantitative  genetics,  appropriately  modified  and 
interpreted,  can  be  applied  to  the  analysis  of  quantitative  traits  in  humans 
and  natural  populations  of  plants  and  animals. 


RESEMBLANCE  BETWEEN  RELATIVES  AND  THE 
CONCEPT  OF  HERITABILITY 

For  Darwinian  evolution  to  be  possible,  a necessary  feature  of  the  transmis- 
sion of  traits  is  that  offspring  must  tend  to  resemble  their  parents.  Even 
before  the  rediscovery  of  Mendel's  work,  Francis  Galton  was  collecting 
detailed  statistical  data  on  resemblance  between  parents  and  offspring 
(Chapter  2).  We  will  demonstrate  the  central  ideas  of  the  transmission  of 
quantitative  traits,  using  some  of  the  concepts  that  Galton  developed.  Then 
we  will  show  how  models  of  Mendelian  inheritance  can  account  for  these 
features  of  hereditary  transmission.  Calculation  of  the  degree  of  resemblance 
among  relatives  in  terms  of  underlying  Mendelian  genetics  was  first  provid- 
ed by  Fisher  (1918).  Fisher's  paper,  notoriously  difficult,  was  of  great  histor- 
ical importance  to  population  genetics,  because  it  provided  the  first  demon- 
stration that  multiple  Mendelian  genes  could  account  for  the  observed  pat- 
terns of  transmission  of  multifactorial  traits. 

Figure  9.2  shows  a plot  of  the  mean  of  male  offspring  for  a quantitative 
trait  (y  values)  against  the  phenotypic  value  of  the  father  (x  values),  dis- 
played in  the  way  Galton  devised.  The  line  is  the  best-fitting  straight  line, 
called  the  regression  line,  of  offspring  on  parent.  Regression  is  relevant  to 
one  of  the  primary  aims  in  animal  and  plant  breeding,  namely  to  be  able  to 
improve  attributes  of  the  stock.  An  essential  part  of  genetic  improvement  is  to 
be  able  to  predict  what  sort  of  offspring  would  be  obtained  from  a given  pair 
of  parents.  For  quantitative  traits,  prediction  cannot  be  done  exactly,  but  a 
statistical  description  of  the  most  likely  offspring  can  be  obtained  by  the  pro- 
cedure of  plotting  the  parent-offspring  regression.  For  reasons  that  will 
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Pupal  weight  of  sires  (micrograms) 


Figure  9.2  Mean  weight  of  male  pupae  of  the  flour  beetle  Tribolium  castaneum, 
against  pupal  weight  of  father  (sire).  Each  point  is  the  mean  of  about  eight  male 
offspring.  The  regression  coefficient  of  male  offspring  weight  on  sire's  weight  is 
b = 0.11,  and  b2  is  estimated  as  2b.  (Courtesy  of  F.D.  Enfield.) 


become  clear  in  a moment,  we  are  interested  in  the  slope  of  the  regression 
line.  The  slope  is  most  easily  expressed  in  terms  of  the  covariance  of  x and  y, 
defined  as  Cov(x,y)  = [I(x  - x)(y  - y)\/n  = (xy)  - (x)(y),  where  the  bar  over  a 
symbol  means  the  average.  This  quantity  is  the  sample  covariance  of  x and  y. 
The  slope  of  the  line  through  a cluster  of  points  having  the  smallest  summed 
squared  distance  to  the  points  is  the  regression  coefficient:  b = 
Cov(x,y)/Var(x).  A related  quantity  that  also  arises  in  quantitative  genetics  is 
the  product-moment  correlation  coefficient,  often  simply  referred  to  as  the 
correlation:  r = Cov(x,y)/VVar  (x)Var  (y). 

An  important  concept  in  statistics  is  the  distinction  between  parameters 
and  estimators.  Descriptors  that  are  calculated  from  a set  of  data  to  describe 
a sample  (such  as  the  sample  mean  and  sample  variance)  are  considered  as 
estimates  of  the  parameters  that  determine  the  true  distribution.  The  sample  is 
thought  of  as  having  been  drawn  from  some  perfect  distribution  (whose 
parameters  we  can  never  know),  and  the  sample  statistics  give  us  a best 
guess  at  what  that  true  distribution  is.  In  statistics,  the  distinction  is  general- 
ly made  by  unadorned  Greek  symbols  for  parameters  and  circumflexes  for 
estimates.  The  usual  symbols  are:  p for  the  parametric  mean,  a2  for  the 
variance,  ovy  for  the  covariance  of  x and  y,  and  p for  the  correlation.  Using  the 
circumflex  notation  for  estimates,  pT  denotes  the  sample  mean  of  x,  so  that, 
Mx  = x-  Similarly,  a2  = Var(x),  and  axy  = Co  v(x,y)  are  the  sample  estimates  of 
the  variance  of  x and  the  covariance  of  x and  y.  When  describing  models  of 
quantitative  genetics,  it  is  the  true  distributions  that  are  of  interest,  and  so  the 
parameters  are  used.  When  describing  the  results  of  an  experiment,  it  is  more 
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appropriate  to  use  the  notation  for  estimates.  The  covariance  and  the  correla- 
tion coefficient  are  convenient  measures  of  the  degree  of  association  between 
x and  y.  If  x and  y are  independent,  then  axy  and  p are  both  zero.  Since  the 
covariance  between  any  two  variables  measures  their  degree  of  association, 
the  covariance  may  be  positive  or  negative.  Positive  covariance  means  that 
values  of  x and  y tend  to  increase  or  decrease  together;  negative  covariance 
means  that,  as  one  variable  increases,  the  other  tends  to  decrease.  The  limit- 
ing values  of  the  covariance  are  -avov  on  the  negative  side,  and  ovo„  on  the 
positive.  The  limits  are  achieved  only  when  the  variables  demonstrate  a per- 
fect linear  relationship  with  each  other. 

Returning  now  to  Figure  9.2,  if  Cov(x,y)  represents  the  covariance 
between  phenotypic  values  of  fathers  (sires)  and  those  of  their  male  off- 
spring, and  Var(x)  represents  the  variance  of  phenotypic  values  of  the  fathers, 
then  the  slope  of  the  regression  line  is  equal  to  the  regression  coefficient, 
Cov(x,y)/Var(x),  which  can  be  seen  as  follows.  Suppose  that  the  equation  of 
the  line  is  represented  as 

y = c + bx  9.1 

where  c and  b are  constants,  b being  the  slope.  Taking  means  of  both  sides 
yields 


y = c + bx  9.2 

subtracting  the  second  equation  from  the  first  yields 

y -y  = (c  + bx)  - (c  - bx)  = b(x  - x)  9.3 

Now  multiply  through  by  x - x to  obtain 

(x  - x)(y  -y)  = b(x  - x)2  9.4 

Taking  means  of  both  sides  produces 

Cov(x,y)  = bVar(x)  9.5 

In  other  words,  the  slope  b of  the  regression  line  equals 

b = Cov(x,y)/Var(x)  9.6 

As  noted,  the  slope  is  called  the  regression  coefficient  of  offspring  on  one 
parent. 

A graphical  interpretation  of  regression  is  illustrated  in  Figure  9.3,  which 
shows  the  distribution,  in  two  dimensions,  of  the  variables  x and  y.  The  vari- 
ables may  represent,  for  example,  the  phenotypic  values  of  parents  (x)  and 
offspring  (y).  When  there  is  no  association  between  x and  y,  the  distribution 
is  a random  scatter  of  points,  and  any  line  through  the  points  fits  equally 
badly.  Figure  9.3  shows  the  appearance  of  the  scatter  of  points  for  different 


b = 0 


x 


Figure  9.3  Plots  of  random  scatters  of  points  having  the  same  variance  on  the 
x axis  but  a range  of  covariances.  With  zero  covariance  (top),  the  regression  coef- 
ficient is  zero.  A stronger  linear  trend  results  in  a higher  regression  coefficient. 
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values  of  association  between  the  two  variables.  Note  that,  while  each  para- 
meter measures  an  aspect  of  association  between  x and  y,  the  covariance,  the 
regression  coefficient,  and  the  correlation  coefficient  are  different  things.  For 
example,  the  covariance  and  the  regression  coefficient  are  unbounded, 
whereas  the  correlation  coefficient  must  be  between  -1  and  1 . 

Two  extreme  examples  may  help  clarify  parent-offspring  regression.  At 
one  extreme,  if  there  were  no  genetic  contribution  to  the  trait,  then  the  scat- 
tergram  might  appear  as  a random  scatter  as  in  the  top  panel  of  Figure  9.3 
with  no  tendency  to  follow  a line.  In  such  a case,  knowing  the  phenotype  of 
the  parents  would  not  help  to  predict  that  of  the  offspring,  because  there 
would  be  no  parent-offspring  resemblance.  On  the  other  hand,  even  with  no 
genetic  variation,  the  points  might  nevertheless  show  a substantial  tendency 
to  follow  a line.  To  see  why  this  is  so,  consider  families  living  in  different 
environments.  In  favorable  environments  with  plenty  of  food  and  resources, 
parents  and  offspring  might  all  be  big  and  strong,  while  in  unfavorable  envi- 
ronments, parents  and  offspring  might  be  small  and  sickly.  A parent-off- 
spring plot  would  show  that  big  strong  parents  have  big  strong  offspring, 
while  small  sickly  parents  have  small  sickly  offspring,  even  though  there  is 
absolutely  no  genetic  basis  for  the  trait.  The  tendency  of  points  to  follow  a 
line  in  a parent-offspring  scattergram  tells  us  nothing  about  the  genetic  basis 
of  the  trait,  unless  we  are  willing  to  make  some  claims  (which  hopefully  can 
be  tested  experimentally)  about  the  environmental  covariance  (the  tendency 
of  parents  and  offspring  to  resemble  one  another  due  to  shared  environ- 
ments). Only  if  there  is  no  environmental  covariance  will  the  parent-offspring 
regression  indicate  a degree  of  genetic  influence  on  the  resemblance.  The 
possibility  of  environmental  covariance  is  absolutely  critical  in  human  quan- 
titative genetics,  where  the  influence  of  shared  environments  can  be  very 
subtle  and  very  strong. 

Assuming  now  that  the  environmental  covariance  is  zero,  the  regression 
coefficient  b of  offspring  on  one  parent  can  be  calculated  for  any 
random-mating  population,  and  it  indicates  the  degree  to  which  the  variance 
in  the  trait  is  determined  by  genetic  variation.  It  is  for  this  reason  that  the 
regression  coefficient  is  related  to  an  important  quantity  in  quantitative 
genetics  called  heritability.  There  are  two  types  of  heritability  that  will  be 
distinguished  shortly,  but  for  now,  we  note  that  the  "narrow-sense"  heri- 
tability (hz)  can  be  estimated  from  the  relationship 

b = V2h2  9.7 

The  V2  occurs  in  Equation  9.7  because  the  regression  involves  only  a sin- 
gle parent  (the  father,  in  the  case  of  Figure  9.2),  and  only  half  of  the  genes 
from  any  one  parent  are  passed  on  to  the  offspring.  In  Figure  9.2,  b = 0.11, 
so  n = 0.22.  Notice  the  considerable  scatter  among  the  points  in  the  figure, 
which  represents  data  from  32  families.  Because  this  sort  of  scatter  is  typical, 
heritability  estimates  tend  to  be  quite  imprecise  unless  based  on  data  from 
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several  hundred  families.  Note  however,  that  even  with  an  enormous  sample, 
there  would  be  no  less  scatter  to  the  points — we  would  merely  have  a more 
accurate  measure  of  how  much  scatter  there  is.  One  further  point  about 
Figure  9.2:  in  organisms  such  as  mammals,  the  regression  is  better  performed 
on  the  father's  phenotype,  rather  than  on  the  mother's,  in  order  to  avoid 
potential  bias  in  the  estimate  of  heritability  caused  by  such  maternal  effects 
as  intrauterine  environment.  In  organisms  where  nurturing  does  not  impart 
significant  maternal  effects,  scattergrams  can  be  constructed  with  the  x axis 
being  the  average  of  the  two  parents  (the  midparent)  and  the  y axis  the  off- 
spring  phenotypes.  From  this  sort  of  plot  the  regression  coefficient  is  equal  to 
the  heritability:  in  symbols,  when  the  x axis  is  the  midparent,  b = h2. 


PROBLEM  9. 1 This  example  of  calculating  h2  from  parent-offspring 
regression  uses  data  from  Cook  (1965),  who  studied  shell  breadth  in 
119  sibships  of  the  snail  Arianta  arbustornm.  For  computational  conve- 
nience, the  data  have  been  grouped  into  six  categories.  Estimate  the 
heritability  of  shell  breadth  from  these  data. 

Number  of  sibships  Midparent  value  (mm)  Offspring  mean  (mm) 


22 

31 

48 

11 

4 

3 


16.25 

18.75 

21.25 

23.75 

26.25 

28.75 


17.73 
19.15 

20.73 
22.84 
23.75 
25.42 


ANSWER  Letting  x refer  to  the  midparent  value  and  y refer  to  the 
offspring  mean,  then,  x = 20.2626,  y = 20.1786,  lx, 2 = 49,823.4375,  £y,2 
= 49,267.1875,  bxy  = 5.1826,  a2,,  = 8.1801,  and  b = h2  = 0.63.  (In  actual 
practice  we  might  not  want  to  group  the  data  into  categories,  because 
there  is  some  loss  of  accuracy  from  grouping.  The  regression  coeffi- 
cient for  the  ungrouped  data  is  b = 0.70.  In  addition,  it  should  be 
noted  that  there  is  substantial  assortative  mating  for  shell  breadth, 
and  so  the  heritability  estimate  is  artificially  large.) 


To  this  point  we  have  shown  that  heritability  can  be  used  to  measure  the 
degree  of  resemblance  between  parents  and  offspring.  Although  the  defini- 
tion of  heritability  in  terms  of  the  regression  coefficient  between  midparents 
and  offspring  is  reasonable,  heritability  defined  in  this  manner  is  merely  a 
descriptive,  empirical  quantity  because  it  makes  no  assumptions  about 
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genetics.  In  the  next  section  we  show  how  heritability  in  this  purely  statisti- 
cal sense  can  be  used  to  predict  the  result  of  artificial  selection. 


ARTIFICIAL  SELECTION  AND  REALIZED  HERITABILITY 

The  deliberate  choice  of  a select  group  of  individuals  to  be  used  for  breeding 
constitutes  artificial  selection.  The  most  common  type  of  artificial  selection 
is  directional  selection,  in  which  phenotypically  superior  animals  or  plants 
are  chosen  for  breeding.  Although  artificial  selection  has  been  practiced  suc- 
cessfully for  thousands  of  years  (for  example,  in  the  body  size  of  domesti- 
cated dogs),  only  during  this  century  have  the  genetic  principles  underlying 
its  successes  become  clear.  Understanding  the  genetic  principles  of  artificial 
selection  permits  prediction  of  the  rapidity  and  amount  by  which  a popula- 
tion can  be  altered  through  artificial  selection  in  any  particular  generation  or 
small  number  of  generations.  The  theory  of  artificial  selection  is  also  strong- 
ly motivated  by  the  idea  that  natural  selection  may  operate  in  a similar  way. 
For  example,  if  only  those  individuals  with  greater  than  a certain  amount  of 
body  fat  survive,  or  only  those  individuals  with  less  than  a critical  rate  of 
evaporative  water  loss  survive,  then  natural  selection  acts  on  the  distribution 
of  phenotypes  in  much  the  same  way  that  breeders  select  characters  of 
agricultural  importance. 

Artificial  selection  in  outcrossing,  genetically  heterogeneous  populations 
is  usually  successful  in  that  the  mean  phenotype  of  the  population  changes 
over  generations  in  the  direction  of  selection  (provided  the  population  has 
not  previously  been  subjected  to  long-term  artificial  selection  for  the  trait  in 
question).  In  experimental  animals,  the  mean  of  almost  any  quantitative  trait 
can  be  altered  in  whatever  direction  desired  by  artificial  selection.  For  exam- 
ple, in  Drosophila,  body  size,  wing  size,  bristle  number,  growth  rate,  egg  pro- 
duction, insecticide  resistance,  and  many  other  traits  can  be  increased  or 
decreased  by  selection.  In  domesticated  animals  and  plants,  birth  weight, 
growth  rate,  milk  production,  egg  production,  grain  yield,  and  countless 
other  traits  respond  to  selection.  Figure  9.4  shows  the  results  of  a long-term 
selection  program  involving  oil  content  in  corn.  Amazingly,  the  line  selected 
for  high  oil  content  is  still  responding  after  more  than  90  generations  (Dudley 
and  Lambert  1992). 

The  general  success  of  artificial  selection  in  outcrossing  species  indicates 
that  a wealth  of  genetic  variation  affecting  quantitative  traits  exists.  On  the 
other  hand,  in  a genetically  uniform  population,  the  mean  phenotype  of  the 
population  cannot  usually  be  changed  through  artificial  selection,  because 
genetic  variation  is  required  for  progress  under  artificial  selection.  For  exam- 
ple, in  experiments  with  the  Princess  bean,  Johanssen  (1909)  found  that  arti- 
ficial selection  consistently  resulted  in  failure  when  practiced  within 
essentially  homozygous  lines.  He  obtained  this  result  because,  in  genetically 
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Figure  9.4  Results  of  a famous  long-term  experiment  selecting  for  high  and 
low  oil  content  in  corn  seeds.  Begun  in  1896,  the  experiment  has  the  longest 
duration  of  any  on  record  and  still  continues  at  the  University  of  Illinois.  Note 
the  steady,  linear  rise  in  oil  content  shown  by  the  upper  curve.  The  lower  curve 
started  on  a roughly  linear  path  and  continued  so  for  about  ten  generations,  but 
then  the  response  tapered  off,  presumably  because  zero  percent  oil  is  an 
absolute  lower  limit  for  the  trait.  (After  Dudley  and  Lambert  1992.) 


homozygous  populations,  the  only  source  of  genetic  variation  comes  from 
new  mutations.  In  contrast,  since  genetically  variable  populations  usually 
respond  to  artificial  selection,  and  genetically  uniform  populations  do  not 
respond,  the  response  to  artificial  selection  might  be  used  as  a measure  of  the 
extent  of  genetic  variation  in  the  trait.  This  notion  of  selection  response 
reflecting  genetic  variation  will  be  formalized  in  the  next  section. 

Prediction  Equation  for  Individual  Selection 

When  individuals  are  selected  for  breeding  based  solely  on  their  own  indi- 
vidual phenotypic  values,  the  type  of  artificial  selection  is  called  individual 
selection.  Figure  9.5  illustrates  a variety  of  individual  selection  called  trun- 
cation selection.  The  curve  in  panel  A represents  the  normal  distribution  of 
a quantitative  trait  in  a population,  and  the  shaded  part  of  the  distribution  to 
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Figure  9.5  Diagram  of  truncation  selection.  (A)  Distribution  of  phenotypes  in 
the  parental  population,  mean  p.  Individuals  with  phenotypes  above  the  trun- 
cation point  ( T ) are  saved  for  breeding  the  next  generation.  The  selected  parents 
are  denoted  by  the  shading  and  their  mean  phenotype  by  ps.  (B)  The  mean  of 
the  distribution  of  phenotypes  in  the  progeny  is  denoted  p'.  Note  that  p'  is 
greater  than  p but  less  than  ps.  The  quantity  S is  called  the  selection  differential, 
and  R is  called  the  response  to  selection. 


the  right  of  the  phenotypic  value  denoted  T indicates  those  individuals 
selected  for  breeding.  The  value  T is  called  the  truncation  point.  The  mean 
phenotype  in  the  entire  population  is  denoted  p,  and  that  of  the  selected  par- 
ents is  denoted  ps.  When  the  selected  parents  are  mated  at  random,  their  off- 
spring have  the  phenotypic  distribution  shown  in  panel  B,  where  the  mean 
phenotype  is  denoted  p'. 

An  example  of  truncation  selection  for  seed  weight  in  edible  beans  is 
shown  in  Figure  9.6.  In  this  example,  T = 650  mg,  p = 403.5  mg,  ps  = 691.7  mg, 
and  p'  = 609.1  mg.  In  this  case— as  is  typical  of  truncation  selection— the  off- 
spring mean  p'  is  greater  than  the  previous  population  mean  p but  less  than 
the  parental  mean  ps.  The  reason  p'  is  greater  than  p is  that  some  of  the 
selected  parents  have  favorable  genotypes  and  therefore  pass  favorable  genes 
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Weight  of  seed  (milligrams) 


Figure  9.6  Truncation  selection  experiment  for  seed  weight  in  edible  beans  of 
the  genus  Phaseolus,  laid  out  as  in  Figure  9.5.  The  truncation  point  (T)  is  650  mg. 
The  selection  differential  S is  the  difference  in  means  between  the  selected  par- 
ents and  the  whole  population.  The  response  R is  the  difference  in  means 
between  the  progeny  generation  and  the  entire  population  in  the  previous  gener- 
ation. The  quantity  R/S  is  the  realized  heritability.  (Data  from  Johannsen  1903.) 


on  to  their  offspring.  At  the  same  time,  p'  is  generally  less  than  ps  for  two 
reasons: 

1 . Because  some  of  the  selected  parents  do  not  have  favorable  genotypes; 
rather,  their  exceptional  phenotypes  result  from  chance  exposure  to 
exceptionally  favorable  environments. 

2.  Because  alleles,  not  genotypes,  are  transmitted  to  the  offspring,  and 
exceptionally  favorable  genotypes  are  disrupted  by  Mendelian  segrega- 
tion and  recombination. 
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The  difference  in  mean  phenotype  between  the  selected  parents  and  the 
entire  parental  population  is  the  selection  differential  and  is  designated  S. 
In  symbols, 

S = Ps-b  9.8 

The  difference  in  mean  phenotype  between  the  progeny  generation  and  the 
previous  generation  is  the  response  to  selection  and  is  designated  R. 
Symbolically, 

R = p'-p  9.9 

In  quantitative  genetics,  any  equation  that  defines  the  relationship 
between  the  selection  differential  S and  the  response  to  selection  R is  known 
as  a prediction  equation.  Since  selection  can  be  applied  to  a population  in 
many  different  ways  (others  will  be  discussed  later  in  this  chapter),  the  pre- 
diction equation  may  differ  corresponding  to  the  different  modes  of  selec- 
tion. A general  prediction  equation  that  applies  to  many  forms  of  selection, 
including  truncation  selection  (the  type  of  selection  illustrated  in  Figure  9.5), 
is 


R = h2S  9.10 

where  h 2 is  the  realized  heritability.  Later  in  this  chapter,  we  will  show  that 
the  realized  heritability  is  identical  to  the  narrow-sense  heritability  defined 
by  regression,  provided  the  phenotypes  and  the  magnitudes  of  genetic 
effects  follow  a bell-shaped  Gaussian  distribution.  These  assumptions  are 
necessary  in  order  to  apply  regression  to  the  problem.  This  equivalence 
emphasizes  again  that  heritability  can  be  understood  at  several  different  lev- 
els. Equation  9.10  implies  that  the  realized  heritability  of  a trait  can  be  inter- 
preted as  a mere  description  of  what  happens  when  artificial  selection  is 
practiced.  In  Figure  9.6,  for  example,  S = 288.2  and  R = 205.6,  so  h2  =R/ S = 
205.6/288.2  = 71.3%.  When  estimated  like  this  from  empirical  data,  h2  is  the 
realized  heritability,  and  it  simply  summarizes  the  observed  result. 


PROBLEM  9.2  Below  are  data  on  the  number  i of  sternital  bristles  in 
samples  from  two  consecutive  generations  Gj  and  G2  of  an  experi- 
ment in  directional  selection  for  increased  bristle  number.  In  the  G, 
generation,  individuals  with  22  or  more  bristles  (enclosed  in  brackets) 
were  mated  together  at  random  to  form  the  G2  generation.  Estimate 
the  realized  heritability  of  the  number  of  sternital  bristle  in  this  exper- 
iment. (Data  kindly  provided  by  Trudy  Mackay.  In  order  to  make  the 
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sexes  comparable,  the  value  of  2 has  been  added  to  the  bristle  number 
in  males.) 


/ 

G, 

G, 

/ 
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24 
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3 

AN  SWER  Estimates  of  the  means  are  p = 2220/15  = 19.3,  p s = 22.7,  p' 
= 2035/11  = 20.1.  The  selection  differential  S = 22.7  - 19.3  = 3.4  (Equa- 
tion 9.8)  and  the  response  R = 20.1  - 19.3  = 0.8  (Equation  9.9).  The 
realized  heritability  estimated  from  Equation  9.10  is  /?  = 0.8/3.4  = 0.235. 


Data  from  experiments  by  Mackay  (1985)  demonstrate  the  potential  sig- 
nificance of  new  mutations  in  quantitative  genetics.  The  base  population  on 
which  selection  was  performed  was  created  by  a cross  that  mobilizes  the 
transposable  element  P that  results  in  new  P-element  insertions  in  the 
germline  and  a syndrome  of  partial  infertility  and  other  reproductive  abnor- 
malities known  as  hybrid  dysgenesis.  As  a control,  a genetically  identical 
base  population  was  formed  by  the  reciprocal  cross,  in  which  the  P element 
is  not  mobilized  and  hybrid  dysgenesis  does  not  occur.  In  the  dysgenic  cross, 
the  realized  heritability  in  abdominal  bristle  number  was  increased  by  40% 
as  compared  with  the  nondysgenic  control.  More  strikingly,  the  phenotypic 
variance  of  bristle  number  in  the  selected  dysgenic  lines  increased  by  a factor 
of  three  over  the  course  of  eight  generations.  These  results  demonstrate  that 
the  genetic  variation  affecting  quantitative  traits  may  even  include  insertions 
of  transposable  elements.  On  the  other  hand,  other  comparable  experiments 
using  hybrid  dysgenesis  have  not  given  such  dramatic  results. 

Selection  Limits 

Progress  under  artificial  selection  does  not  continue  forever.  Any  population 
must  eventually  reach  a selection  limit,  or  plateau,  after  which  it  no  longer 
responds  to  selection.  One  of  the  reasons  that  a population  eventually  reach- 
es a plateau  is  exhaustion  of  genetic  variance,  such  that  all  alleles  affecting 
the  selected  trait  have  become  fixed,  lost,  or  are  otherwise  unavailable  for 
selection.  With  no  genetic  variance,  no  progress  under  individual  selection 
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can  be  achieved.  However,  many  experimental  populations  that  have 
reached  a selection  limit  readily  respond  to  reverse  selection  (selection  in  the 
reverse  direction  of  that  originally  applied),  so  genetic  variance  affecting  the 
trait  is  still  present.  Indeed,  in  such  populations,  the  phenotype  may  change 
in  the  direction  of  its  original  value  if  continuing  artificial  selection  is  simply 
suspended  (relaxed  selection).  The  consequences  of  relaxed  selection  for  one 
example  in  Drosophila  are  illustrated  in  Figure  9.7. 

One  frequent  reason  for  the  occurrence  of  selection  limits  in  populations 
with  considerable  genetic  variation  is  that  artificial  selection  is  opposed  by  nat- 
ural selection.  In  mice,  for  example,  response  to  selection  for  small  body  size 
ultimately  ceases  because  small  animals  are  less  fertile  than  larger  ones,  and  the 
smallest  animals  are  sterile  (Falconer  and  Mackay  1996).  Selection  for  small 
body  size  gradually  becomes  less  effective  due  to  the  opposing  effects  of  natur- 
al selection  until,  eventually,  no  further  progress  is  possible.  When  selection  is 
relaxed,  the  natural  selection  is  unopposed  and  results  in  a retrogression  in  the 
artificially  selected  trait.  Some  backward  slippage  with  relaxed  selection  also 
results  from  diminution  in  the  linkage  disequilibrium  that  usually  builds  up 
during  the  course  of  long-term  artificial  selection.  If  natural  selection  opposes 
the  artificial  selection,  then  when  artificial  selection  is  relaxed,  natural  selection 
results  in  at  least  a partial  return  to  the  initial  phenotypic  mean. 


Figure  9.7  Response  to  selection  for  wind  tunnel  flight  speed  in  Drosophila 
melanogaster.  One  line  was  maintained  without  selection  for  30  generations  start- 
ing at  generation  65,  and  another  was  maintained  without  selection  for  10  gen- 
erations starting  at  generation  85  (triangles).  In  these  examples,  the  flight 
performance  did  not  degrade  after  selection  was  relaxed.  Apparently  the  selection 
response  occurred  with  little  correlated  response  on  fitness.  (After  Weber  1996.) 
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TABLE  9.1  SELECTION  LIMITS  AND  DURATION  OF  RESPONSE  FOR 
VARIOUS  TRAITS  IN  LABORATORY  MICE 

Character  selected 

Direction  of 
selection 

Total  response ° 

Half-life  of  response b 

Weight  (in  strain  N) 

Up 

3.4  op 

0.6  N 

Down 

5.6  Op 

0.6  N 

Weight  (in  strain  Q) 

Up 

3.9  Op 

0.2  N 

Down 

3.6  Op 

0AN 

Growth  rate 

Up 

2.0  Op 

0.3  N 

Down 

4.5  Op 

0.5  N 

Litter  size 

Up 

1.2  Op 

0.5N 

Down 

0.5  Op 

0.5  N 

Source:  From  Falconer  1977. 


Total  response  is  expressed  as  a multiple  of  the  initial  phenotypic  standard  deviation,  o 
' Half-life  of  response  is  the  number  of  generations  taken  to  progress  halfway  to  the  selection 
limit;  here  the  half-life  is  expressed  in  multiples  of  effective  population  number  (N). 


In  most  genetically  heterogeneous  populations,  artificial  selection  can 
change  the  phenotype  well  beyond  the  range  of  variation  found  in  the  origi- 
nal population.  Pertinent  data  for  populations  of  mice  are  presented  in  Table 
9.1.  As  can  be  seen,  a total  selection  response  of  three  to  five  times  the  origi- 
nal phenotypic  standard  deviation  is  not  unusual,  and  for  selection  to  change 
a population  of  effective  size  N halfway  to  its  selection  limit  typically 
requires  about  1/2N  generations. 

In  some  cases  the  total  response  to  artificial  selection  is  very  large.  For 
example,  in  a long-term  selection  experiment  for  pupal  weight  in  Triboli- 
um,  in  which  the  base  population  consisted  of  the  progeny  of  a cross 
between  two  inbred  lines,  100  generations  of  selection  resulted  in  a popula- 
tion in  which  the  mean  pupal  weight  in  the  selected  population  was  17 
standard  deviation  units  greater  than  the  mean  in  the  base  population 
(Enfield  1980).  The  ability  to  select  a population  in  which  virtually  every 
phenotype  is  greater  than  the  maximum  in  the  original  population  strikes 
many  students  as  paradoxical.  It  does  seem  plausible  to  argue  that,  if  all  of 
the  alleles  eventually  selected  are  already  present  in  the  original  popula- 
tion, then  all  possible  favorable  genotypes  should  be  present  also,  though 
perhaps  at  low  frequency.  The  fallacy  in  the  argument  is  that  real  popula- 
tions subjected  to  artificial  selection  are  actually  small  in  size,  consisting  of 
at  most  a few  hundred  organisms.  Therefore,  if  the  favored  alleles  are  rare, 
then  the  frequency  of  the  favored  genotypes  may  be  so  small  that  the 
expected  number  of  such  genotypes  will  be  much  smaller  than  one,  and  so 
the  superior  genotypes,  while  theoretically  possible,  do  not  actually  exist  in 
the  original  population. 
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Some  traits  consistently  fail  to  respond  to  artificial  selection,  suggesting 
a lack  of  suitable  genetic  variation.  Bilateral  symmetry  is  an  example  of  a 
trait  that  has  not  been  amenable  to  change  by  artificial  selection.  The  failure 
of  Maynard-Smith  and  Sondhi  (1961)  to  create  bilateral  asymmetry  in 
Drosophila  by  selecting  for  an  excess  of  dorsal  bristles  on  the  left  side  is  typi- 
cal. The  apparent  lack  of  genetic  variation  determining  bilateral  asymmetry 
is  of  interest  in  regard  to  embryonic  development,  for  it  implies  that  the 
genetic  control  of  development  of  symmetrical  structures  specifies  patterns 
that  are  common  to  the  left  and  the  right  sides  of  the  body.  That  is,  rather 
than  left-bristle  genes  and  right-bristle  genes,  there  appear  to  be  generic 
bristle  genes  whose  spatial  expression  is  determined  symmetrically.  Of 
course  asymmetrical  structures  do  exist  (such  as  the  vertebrate  heart)  and 
recently  inroads  have  been  made  in  understanding  the  molecular  genetic 
basis  for  this  asymmetry  (Isaac  et  al.  1997).  Genes  that  affect  left-right  asym- 
metry do  not  do  so  in  a continuous  manner — rather  they  either  successfully 
establish  the  asymmetry  or  they  do  not;  absence  of  symmetry  is  fatal. 

Not  all  traits  with  heritable  variation  obey  the  prediction  equation  and 
show  a simple  linear  change  in  the  mean.  Sometimes  a trait  responds  to 
directional  selection  for  a few  generations,  then  ceases  to  respond,  but  later 
responds  again  as  selection  is  continued.  One  possible  mechanism  for  this 
stop-and-start  response  is  that  the  population  at  a plateau  is  in  linkage  dise- 
quilibrium, and  it  takes  time  for  recombination  to  break  up  the  allelic  associ- 
ations and  release  the  latent  genetic  variation.  This  phenomenon  was 
observed  in  a long-term  study  of  the  quantitative  genetics  of  wing  veins  in 
Drosophila  (Scharloo  1987).  In  this  case  a bimodal  phenotypic  distribution 
was  also  generated  during  selection  (Figure  9.8),  which  was  proposed  to 
reflect  a nonlinear  mapping  from  genetic  and  environmental  factors  to  the 
determination  of  phenotype. 

As  we  have  seen,  heritability  can  be  interpreted  in  purely  statistical  terms 
with  no  genetic  content.  However,  if  we  postulate  that  there  are  Mendelian 
genes  underlying  the  phenotypes,  then  the  genetic  underpinning  allows  us 
to  do  more  than  merely  describe  statistical  relations  among  individuals.  By 
bringing  Mendelian  genetics  into  the  picture,  we  will  see  why  the  response  to 
any  kind  of  artificial  selection  is  determined  by  the  magnitude  of  the  heri- 
tability. In  particular,  the  genetic  basis  of  response  to  artificial  selection  comes 
from  changes  in  gene  frequencies  and  sometimes  also  to  changes  in  linkage 
disequilibrium. 


GENETIC  MODELS  FOR  QUANTITATIVE  TRAITS 

When  h2  is  interpreted  as  realized  heritability,  then  Equation  9.10  is  hardly  a 
“prediction  equation"  inasmuch  as  it  merely  describes  what  has  already  hap- 
pened in  one  generation  of  selection.  Of  course,  the  equation  could  be  used 
to  predict  the  result  of  the  next  generation  of  selection,  but  artificial  selection 


Females 


Males 


Figure  9.8  Frequency  distributions  in  females  (left)  and  males  (right)  of  a line 
of  Drosophila  melanogaster  selected  for  fourth  wing  vein  length.  The  light  lines 
represent  selection  for  a short  vein,  and  black  lines  represent  selection  for  a long 
vem.  In  the  line  selected  for  long  veins,  both  sexes  displayed  a bimodal  fre- 
quency distribution  when  the  relative  vein  length  was  approximately  60-80% 
(From  Scharloo  1987.)  y 
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is  impossible  in  many  natural  populations  and  is  time  consuming  and 
expensive  in  many  domesticated  plants  and  animals.  It  would  therefore  be 
useful  if  one  could  estimate  heritability  without  actually  performing  any 
artificial  selection.  If  the  heritability  h“  could  be  estimated  in  such  a manner, 
then  Equation  9.10  would  be  a true  prediction  equation  in  the  sense  that  the 
response  R could  be  predicted  for  any  selection  differential  S,  based  on  the 
estimated  value  of  h2.  Such  an  estimate  of  h2  is  indeed  possible,  but  it 
involves  an  understanding  of  heritability  at  a level  that  includes  the  under- 
lying genetic  basis  of  quantitative  traits. 

An  understanding  of  the  genetics  behind  Equation  9.10  requires  three 
items:  (1)  a concept  of  how  alternative  alleles  of  a gene  affect  a quantitative 
trait;  (2)  a determination  of  how  selection  changes  the  allele  frequencies;  and 
(3)  a calculation  of  how  much  the  mean  of  the  trait  increases  as  a result  of  the 
change  in  allele  frequency.  Some  detail  is  required  to  establish  these  three 
items,  but  the  detail  is  necessary  in  order  to  understand  the  genetic  meaning 
of  heritability. 

Nilsson-Ehle  (1909)  was  the  first  to  show  that  a trait  with  a nearly  contin- 
uous distribution  of  phenotypes  could  result  from  the  joint  effects  of  several 
genes.  The  trait  of  interest  is  the  intensity  of  red  pigment  in  the  glume  of 
wheat  Triticum  vulgare,  which  Nilsson-Ehle  found  to  result  from  three 
unlinked  genes,  each  with  two  alleles.  The  situation  is  exceptionally  simple 
for  a quantitative  trait;  the  environment  has  a negligible  effect  on  phenotype, 
because  the  alleles  of  each  gene  are  additive  (i.e.,  heterozygotes  have  a phe- 
notype  that  is  exactly  intermediate  between  homozygous  phenotypes),  and 
because  the  genetic  effects  are  also  additive  across  genes  (i.e.,  the  total  genet- 
ic effect  of  any  three-locus  genotype  is  just  the  sum  of  the  separate  effects  of 
each  gene).  To  simplify  matters,  consider  just  two  of  the  genes,  and  let  their 
alleles  be  denoted  (A,  a)  and  (B,  b).  With  additivity  within  and  across  genes, 
we  may  assume  that  the  genotype  aabb  has  a color  score  of  0 (white)  and  that 
each  A or  B allele  in  the  genotype  contributes  one  unit  of  red  pigment.  Figure 
9.9A  shows  the  nine  possible  two-gene  genotypes,  their  frequencies  with  ran- 
dom mating  when  the  allele  frequencies  of  A and  B are  both  y2,  and  the  color 
score  of  each  genotype  assuming  additivity.  The  mean  color  score  of  the  pop- 
ulation is  2.  Indeed,  when  the  allele  frequencies  of  A and  B are  both  p,  then 
the  mean  of  a population  with  random  mating  can  be  shown  to  equal  4 p.  To 
connect  this  trait  with  the  prediction.  Equation  9.10,  suppose  that  the  two 
lowest  phenotypic  classes  (i.e.,  0 and  1)  are  selected  as  parents  of  the  next 
generation.  We  first  calculate  ps,  b',  S,  R,  and  h2  = R/S  using  the  allele  fre- 
quency of  A and  B among  selected  parents;  then  we  use  the  mean  = 4 p for- 
mula to  obtain  the  mean  of  the  offspring  with  random  mating. 

In  this  example,  p = 4(V2)  = 2 is  given.  The  selected  parents  consist  of 
genotypes  Aabb,  aaBb,  and  aabb  with  respective  frequencies  2/5, 2/5,  and  y5,  and 
the  mean  of  parents  = ps  = (2/5)(l)  + (2/s)(l)  + (i/5)(0)  = %.  The  allele  frequency 
of  A and  B among  parents  = (V2)(2/5)  = V5,  and  therefore  the  mean  among  off- 
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Figure  9.9  Frequencies  of  two-locus  genotypes  (outside  circles)  and  respective 
phenotypes  (within  circles)  in  a population  with  allele  frequency  V2  for  each 
locus.  Panel  A illustrates  the  case  of  additivity  of  effects  at  each  locus  and  across 
loci.  In  panel  B,  A and  B are  each  dominant  to  a and  b respectively,  but  the 
effects  of  the  two  loci  are  additive. 


spring  is  p'  = 4(i/s)  = 4/s.  Then  S = (%)  - 2 = -%  and  R = (%)  - 2 = so  h2  = 
R/S  = 1.0.  As  demonstrated  in  the  next  paragraph,  this  high  heritability  is 
due  to  the  additivity  within  and  across  genes  and  not  merely  to  the  fact  that 
environmental  effects  are  negligible. 

Figure  9.9B  refers  to  a hypothetical  situation  in  which  the  A and  B alleles 
are  dominant  but  still  additive  across  genes.  Thus,  genotypes  AA,  Aa,  BB, 
and  Bb  each  add  one  unit  of  red  pigment  to  the  phenotype.  In  this  case,  it 
can  be  shown  that  the  mean  of  a random-mating  population  with  allele  fre- 
quencies of  A and  B both  equal  to  p is  given  by  2p(l  + q),  where  q = 1 - p.  If 
the  two  lowest  phenotypic  classes  (i.e.,  0 and  1)  are  selected  as  parents  of  the 
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next  generation,  then  the  mean  of  parents  = ps  = (V7XI)  + (2/7)(l)  + (V7XI)  + 
(2/7)(l)  + (V7)(0)  = 6/7.  The  allele  frequency  of  A and  B among  parents  is 
V = (V7)  + (V2)(2/7)  = 2/7,  and  the  mean  of  the  offspring  is  therefore  p'  = 
2(2/7)[l  + (5/7)]  = 4%9.  Thus,  S = (%)  - (3/2)  = -%4  and  R = (4%9)  - (3/2)  = sifa.  In 
the  case  where  A and  B are  dominant,  so  h2  = 51/63  = 0.81.  Although  environ- 
mental effects  on  seed  color  are  still  negligible  in  the  dominance  case,  the 
heritability  has  become  less  than  1.0.  This  perhaps  surprising  result  occurs 
because  certain  genetic  effects  (such  as  those  resulting  from  dominance  or, 
in  other  examples,  nonadditivity  across  genes)  are  not  useful  in  changing  a 
population  by  means  of  the  type  of  individual  selection  discussed  here. 

To  see  how  an  underlying  genetic  model  can  be  formulated  for  continuous 
characters,  refer  to  Figure  9.10,  which  shows  the  normal  distribution  of  a trait 
in  a hypothetical  random  mating  population.  In  truncation  selection,  all  indi- 
viduals with  phenotypes  above  the  truncation  point  T are  saved  for  breeding, 
and  the  shaded  area  B of  the  distribution  represents  the  proportion  of  the  pop- 
ulation selected.  (The  total  area  under  any  normal  density  equals  1.)  The 
height  of  the  normal  density  at  the  point  T is  denoted  Z,  and,  as  before,  the 
mean  phenotype  among  the  selected  individuals  is  called  ps.  One  of  the  spe- 
cial properties  of  the  normal  distribution  to  be  used  below  is  that 

(\is-p)/o2  = Z/B  9.11 

To  determine  the  amount  of  increase  in  mean  phenotype  in  a population 
resulting  from  one  generation  of  truncation  selection,  we  first  imagine  a gene 


Mean  of 
population  (pi) 


Figure  9.10  Normal  distribution  of  a quantitative  trait  in  a hypothetical  popu- 
lation, showing  some  important  symbols  used  in  quantitative  genetics.  Here  p is 
the  mean  of  the  population,  T the  truncation  point,  Z the  height  (ordinate)  of  the 
normal  density  at  the  point  T,  B is  the  shaded  area  under  the  normal  curve  to 
the  right  of  T,  and  ps  is  the  mean  among  selected  parents. 
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that  affects  the  trait  in  question  and  that  has  alleles  A and  A'  with  respective 
allele  frequencies  p and  q.  Because  of  random  mating,  genotypes  AA,  AA', 
and  A A'  are  present  in  the  population  with  frequencies  p2,  2 pq,  and  q2, 
respectively,  but  the  individual  genotypes  cannot  be  identified  through  their 
phenotypic  values  because  of  the  variation  in  phenotype  caused  by  environ- 
mental factors  and  genetic  differences  in  other  genes.  If  the  genotypes  could 
be  identified,  their  individual  distributions  of  phenotypic  value  might  appear 
as  shown  in  Figure  9.11.  Each  distribution  is  normal  and  has  the  same  vari- 
ance, but  the  means  are  very  slightly  different.  The  mean  phenotypes  of  AA, 
AA',  and  A' A'  genotypes  are  denoted  p*  + a,  p*  + d,  and  p*  - a,  respectively. 
The  symbols  a and  d serve  as  convenient  representations  of  the  effects  of  the 
alleles  in  question  on  the  quantitative  trait.  The  difference  between  means  of 
homozygotes  is  (p*  + a)  - (p*  - a)  = 2a,  and  d/a  serves  as  a measure  of  domi- 
nance. The  relationship  d = a means  that  A is  dominant,  d = 0 implies  addi- 
tivity (heterozygotes  exactly  intermediate  in  phenotype  between  the 
homozygotes),  and  d = -a  means  that  A'  is  dominant.  (Use  of  a and  d in  this 
manner  simplifies  some  of  the  subsequent  formulas.)  Calculation  of  a and  d 
for  an  actual  example  involving  two  alleles  that  affect  coat  coloration  in 
guinea  pigs  is  illustrated  in  Table  9.2.  In  this  case,  a = 0.127,  d = -0.016 
(the  negative  sign  on  d means  that  the  cd  allele  is  partially  dominant),  and 


Distribution  in 
whole  population 


Figure  9.1 1 Same  distribution  as  in  Figure  9.10,  showing  the  slightly  different 
distribution  of  phenotypic  value  among  the  three  genotypes  (AA,  AA' , and 
A' A')  for  a gene  with  two  alleles  that  contributes  to  the  quantitative  trait.  The 
means  of  the  distributions  of  AA,  AA’ , and  A’ A'  are  symbolized  p*  + a,  p*  + d, 
and  p*  - a,  respectively. 
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TABLE  9.2  CALCULATION  OF  p*,  a,  AND  d FOR  ALLELES  AT  A LOCUS 
AFFECTING  COAT  COLORATION  IN  GUINEA  PIGSa 


Genotype  Amount  of  black  colorationb 


1.202  = p * + a = 1.075  + 0.127 
1.059  = p*  + rf  = 1.075  - 0.016 
0.948  = p*  - a = 1.075  - 0.127 
p*  = (1.202  + 0.948)/2  = 1.075 
a = 1.202  - 1.075  = 0.127 
d = 1.059  - 1.075  = -0.016 

Source:  Data  from  Wright  1968. 

" The  calculations  to  be  carried  out  first  are  those  beneath  the  data;  then  the  right-hand 
column  is  completed. 

b Here  the  amount  of  black  coloration  is  measured  as  arcsin  (Vx),  where  x is  the  percentage  of 
black  coloration  on  the  animal.  For  c'cr,  c'd1,  and  cV  genotypes,  the  corresponding  x values  are 
87%,  76%,  and  66%,  respectively. 


cV  (AA) 
crcd  (AA') 
cdcd  (A'A') 


d/  a = -0.126.  Assuming  Hardy-Weinberg  genotype  frequencies,  the  mean 
phenotype  in  the  entire  population  is 

p = p2{ p*  + a)  + 2pq(p*  + d)  + q2( p*  - a)  9.12 


PROBLEM  9.3  Crosses  between  the  Danmark  (P,)  and  Red  Currant 
(P2)  tomato  gave  the  following  mean  fruit  weights  and  their  log  trans- 
forms. P]  and  P2  are  the  parental  means,  F,  and  F2  are  the  first  and  sec- 
ond hybrid  generation,  and  B:  and  B2  are  the  progeny  of  the  backcross 
of  Fj  x P,  and  Fi  x P2,  respectively. 


Expected  mean 

Mean  weight 

Log  (weight) 

Pi 

|i  + a 

10.36  ± 0.581 

0.98  ± 0.03 

p2 

p -a 

0.45  ± 0.017 

-0.36  ± 0.02 

Fi 

P + d 

2.33  ± 0.130 

0.33  ± 0.03 

f2 

|1  + 

2.12  ± 0.105 

0.27  ± 0.01 

Eh 

p + y2(a  + d ) 

4.82  ± 0.253 

0.64  ± 0.02 

b2 

p + l/2{d  - a) 

0.97  ± 0.045 

-0.05  ± 0.01 

Use  this  information  to  calculate  p,  a,  and  d for  both  the  weights  and 
the  log  transformed  weights.  Do  the  simple  weights  or  the  log  trans- 
formed weights  fit  the  model  better?  (Data  from  Powers  1951.) 
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ANSWER  The  difference  between  the  two  parental  means  is  2a,  so  a = 
(10.36  - 0.45)/2  = 4.96.  This  gives  p = 5.4.  The  F,  has  a mean  p + d - 2.33, 
so  d = 2.33  - 5.4  = -3.07.  The  F2  should  have  mean  (y4)(p  + a)  + 
xh  (M  + d)  + ]/4  (p  - a)  = p + v2d  = 5.4  + i/2(-3.07)  = 3.86.  The  B,  refers  to 
backcrosses  of  the  Fj  to  Pj,  which  should  yield  one-half  of  genotypes 
like  P]  and  one-half  of  genotypes  like  the  Fj,  so  the  mean  should  be 
V2(p  + a)  + y2(p  + d)  = p + y2(fl  + d)  = 6.34.  Similar  reasoning  gives  an 
expected  mean  for  B2  of  1.38.  The  estimates  for  the  means  of  the  F2/  B,, 
and  B2  do  not  fit  very  well  at  all.  Trying  again  with  the  log  trans- 
formed data,  we  get  a = 0.67,  p = 0.31,  and  d = 0.02.  The  expected 
means  for  the  F2,  Blx  and  B2  are  then  0.31  + V2(0.02)  = 0.32,  0.31  + 
y2(0.67  + 0.02)  = 0.65,  and  0.31  + V2(0.02  - 0.67)  = -0.01.  The  log  trans- 
formed data  clearly  fit  much  better,  suggesting  that  the  better  scale  to 
use  for  the  quantitative  genetic  models  is  the  log  transformed  scale.  In 
actual  practice,  the  entire  set  of  data  is  used  to  estimate  p,  a,  and  d by 
a method  known  as  least  squares,  and  the  goodness  of  fit  of  the  model 
to  the  data  can  be  tested  by  a chi-square  test. 


Effects  of  the  scale  of  measurement  are  known  as  scaling  effects.  For  exam- 
ple, the  a and  d values  in  Table  9.2  are  different  when  calculated  for  the  per- 
cent of  black  coloration  x or  for  the  arcsin  (Vx)  tabulated  values.  Since 
estimates  of  the  additive  and  dominance  values  of  alleles  depend  on  scaling, 
so  does  the  heritability.  An  important  point  is  that  the  equivalence  between 
the  heritability  defined  by  parent-offspring  regression  and  by  realized  heri- 
tability depends  on  the  correct  choice  of  scaling.  Only  one  scaling  provides  a 
normal  Gaussian  distribution  of  phenotypes  and  of  genetic  effects,  and  that 
is  the  appropriate  scaling  that  yields  the  prediction  Equation  9.10. 

Change  in  Gene  Frequency 

Suppose  for  the  moment  that  we  were  practicing  artificial  selection  for 
increased  amount  of  black  coat  coloration  in  the  guinea  pigs  in  Table  9.2. 
Selection  for  black-coat  coloration  in  a population  containing  both  the  cr 
(i.e..  A)  and  cd  (i.e..  A')  alleles  would  be  successful  in  increasing  the  allele 
frequency  of  A,  and  the  average  amount  of  black  coloration  among  indi- 
viduals of  the  next  generation  would  increase.  Therefore,  in  order  to 
calculate  the  expected  increase  in  black  coloration  in  one  generation  of 
selection,  we  must  first  calculate  the  corresponding  change  in  the  allele 
frequency  of  A.  An  equation  for  change  in  allele  frequency  with  natural 
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selection  was  derived  in  Chapter  6,  which  remains  valid  for  artificial  selec- 
tion if  we  agree  to  interpret  the  "fitness"  of  an  individual  as  the  probabil- 
ity that  the  individual  is  included  among  the  group  selected  as  parents  of 
the  next  generation.  With  this  interpretation  of  fitness,  differences  in  fit- 
ness (i.e.,  reproductive  success)  of  AA,  AA' , and  A' A'  genotypes  corre- 
spond to  the  differences  in  area  to  the  right  of  the  truncation  point  in 
Figure  9.11,  because  only  those  individuals  in  the  shaded  area  are  allowed 
to  reproduce.  The  differences  in  area  are  easy  to  calculate  if  you  shift  or 
slide  each  curve  horizontally  until  its  mean  coincides  with  p*.  The  A' A' 
curve  must  slide  a units  to  the  right,  and  the  AA'  and  AA  curves  must  slide 
d and  a units  to  the  left.  This  shifting  brings  the  distributions  into  coinci- 
dence, but  it  slides  the  truncation  points  slightly  out  of  register,  as  shown 
in  Figure  9.12.  The  difference  in  "fitness"  between  AA  and  AA' , denoted 
wn  ~ wn  (as  in  Chapter  6),  is  equal  to  the  small  area  indicated  in  Figure 
9.12,  as  is  the  difference  in  fitness  between  AA'  and  A' A',  denoted  wu  - 
zv22.  The  areas  corresponding  to  wn  - w 12  and  w12  - w22  are  approximately 
rectangles,  and  the  area  of  a rectangle  is  the  product  of  the  base  and  the 
height.  The  approximation  is  most  accurate  when  the  effect  of  this  one 
locus  on  the  phenotype  is  small.  Therefore,  since  Z represents  the  height 
of  the  normal  distribution  at  the  point  T,  we  can  make  the  following 
approximations 


wu  - wu  ~ Z[(T  - d)-(T  - a)]  = Z(a  - d) 
wl2  - w 22  = Z[(T  + a)~(T-  d)]  = Z(a  + d) 


9.13 


The  average  fitness  w of  the  entire  population  simply  equals  B,  because  B is 
the  proportion  of  the  population  saved  for  breeding.  From  Chapter  6 we 
know  that 


Ap  = pq[p(zvu  - wn)  + q(zvu  - w22)]/  w 

where  Ap  is  the  change  in  frequency  of  the  allele  A in  one  generation  of  selec- 
tion. Substituting  from  Equation  9.13  and  using  w = B leads  to 

A p = pq[pZ(a-d)  + qZ(a  + d)]/B  9.14 

or,  since  p + q = 1, 

Ap  = (Z/B )pc][a  + (q  - p)d]  9.15 

An  equation  corresponding  to  9.15  could  be  obtained  for  any  gene  affect- 
ing the  trait,  but  the  values  of  p,  a,  and  d would  differ  for  each  gene.  The 
quantity  in  square  brackets  in  Equation  9.15  is  called  the  average  excess.  A 
generalization  that  accounts  for  nonrandom  mating  is  found  in  Falconer 
(1985). 
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Distributions  in  A' A', 
AA' , and  AA,  shifted 
to  coincide 


/ 

\ /Distribution  in  entire  population 


Area  = wu  - w12 
•-Area  = w12-  w22 


Area  = B 


T-a  T-d  T + a 


Figure  9.12  Same  distribution  as  in  Figures  9.10  and  9.11,  but  with  the  distrib- 
utions of  AA,  AA',  and  A' A'  shifted  laterally  to  coincide.  Shifting  the  distribu- 
tions slides  the  truncation  points  slightly  out  of  register,  so  the  truncation  points 
for  AA,  AA',  and  A' A'  become  T-a,  T-d,  and  T + a,  respectively.  The  small  area 
that  is  denoted  Wu  - zui2  is  the  difference  between  the  proportions  of  A A and 
AA'  genotypes  that  are  included  among  the  selected  parents,  and  the  area 
wn  - w22  is  the  difference  in  the  proportion  of  AA'  and  A' A'  genotypes  included 
among  the  selected  parents. 


Genetic  Model  for  the  Change  in  Mean  Phenotype 

Equation  9.15  provides  an  expression  for  A p which  can  be  used  to  calculate 
the  mean  phenotypic  value  of  coat  color  after  one  generation  of  selection.  In 
the  next  generation,  the  allele  frequencies  of  A and  A'  are  p + Ap  and  q - A p, 
respectively.  With  random  mating,  the  mean  phenotype  in  this  generation  is 
given  by  Equation  9.12  as 


P'  = (P  + Ap)2(p*  + a) 

+ 2 (p  + A p)(q  - Ap)(p*  + d ) 
+ (q-Ap)\p*-a). 


9.16 


When  the  right-hand  side  of  this  expression  is  multiplied  out  and  terms  in 
(Ap)2  are  ignored  because  Ap  is  usually  small,  then  p'  is  found  to  be  approx- 
imately 


p'  = p + 2[a  + (q  - p)d]Ap 


9.17 


The  approximation  in  Equation  9.17  is  rather  good  even  for  relatively  large 
values  of  Ap. 
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Equation  9.17  warrants  a little  more  development  since  it  yields  the  pre- 
diction equation  R = h2S  (Equation  9.10)  and  also  provides  an  expression  for 
h2  in  terms  of  the  parameters  a,  d,  and  p that  can  be  interpreted  genetically. 
First,  rewrite  Equation  9.17  as 

p'  - p = 2[a  + (q  - p)d]Ap  9.18 

Then  substitute  for  A p from  Equation  9.15,  which  yields 

p'  - p = {Z/B)2pq[a  + (q  -p)d]2  9.19 

Now  use  the  expression  for  Z/B  given  in  Equation  9.11  to  obtain 

p'  - p = (ps  - p)2pq[a  + (q  - p)d]2/o2  9.20 

Finally,  substitute  from  Equations  9.8  and  9.9  for  the  selection  differential  S 
and  the  response  R,  yielding 

R = (S)2pq[a  + (q  - p)d]2/a2  9.21 

However,  R = h2S  also  (Equation  9.10),  and  so 

h2  = 2pq[a  + (q  - p)d]2 / a2  9.22 

Equation  9.22  for  h2  is  the  one  we  were  after,  as  it  defines  the  heritability  in 
terms  of  p,  q,  a,  and  d — each  of  which  has  a genetic  meaning. 

Equation  9.22  is  a valid  approximation  when  a single  gene  affects  the  trait 
in  question,  and  when  the  effects  of  that  gene  are  small.  However,  when 
many  genes  affect  the  trait,  the  right-hand  side  of  the  equation  must  be 
replaced  by  a summation  of  such  terms,  one  for  each  gene.  That  is,  for  many 
genes,  R = h2S  where 

h2  = Z2pq[n  + (q  - p)d]2/ a2  9.23 

in  which  the  summation  is  over  all  genes  that  affect  the  trait.  (However,  each 
gene  may  have  different  values  of  a,  d,  p,  and  q.)  As  will  be  discussed  in  more 
detail  later,  the  quantity 

a2  = Y2pq[a  + (q  - p)d]2  9.24 

is  called  the  additive  genetic  variance  of  the  trait.  Although  the  individual 
components  in  the  additive  genetic  variance  are  difficult  to  identify  except  in 
contrived  examples  like  the  one  involving  guinea  pigs,  the  collective  effects 
(represented  by  the  summation)  can  be  estimated. 

COMPONENTS  OF  PHENOTYPIC  VARIANCE 

As  Equation  9.24  suggests,  the  variance  of  a quantitative  trait  can  be  split 
into  various  components  representing  different  causes  of  variation. 
Similarity  between  relatives  is  conveniently  expressed  in  terms  of  the  vari- 
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ance  components,  but  variance  partitioning  is  also  of  interest  in  its  own  right. 
Since  the  rate  of  change  of  a trait  under  selection  depends  on  the  amount  of 
genetic  variation  affecting  the  trait,  if  there  is  no  genetic  variation,  there  is 
obviously  no  response  to  selection.  What  is  not  so  obvious  is  that  some  com- 
ponents of  genetic  variation  cannot  be  acted  upon  by  some  kinds  of  selec- 
tion. In  other  words,  certain  populations  have  ample  genetic  variation,  yet 
fail  to  respond  to  selection.  The  part  of  the  genetic  variation  amenable  to 
selection  is  clarified  by  partitioning  the  variance. 

Genetic  and  Environmental  Sources  of  Variation 

As  shown  in  Table  9.3,  the  phenotypic  value  of  any  individual  can  be  repre- 
sented as  a sum  of  three  components:  (1)  the  mean  p of  the  entire  population, 
(2)  a deviation  from  the  population  mean  due  to  the  specific  genotype  of  the 
individual  in  question  (symbolized  as  Gj,  G2,  and  G3  for  AA,  AA' , and  A' A' 
genotypes,  respectively),  and  (3)  a deviation  from  the  population  mean  due 
to  the  specific  microenvironment  of  the  individual  in  question.  (The  envi- 
ronmental deviations  are  unique  to  each  individual  and  are  represented  as 
E3,  E2,  . . . , E9.)  These  microenvironmental  effects  might  be  due  to  random 
differences  in  nutrition,  temperature,  or  other  external  factors,  or  they  might 
be  seen  even  in  an  absolutely  uniform  external  environment  due  to  the 
vagaries  of  embryonic  development.  It  is  important  to  note  that  the  Gs  and 
Es  are  not  directly  observable.  Nevertheless,  as  we  shall  see,  the  total  vari- 
ance in  phenotypic  value  can  be  partitioned  into  a component  due  to  varia- 
tion among  the  Gs  and  another  component  due  to  variation  among  the  Es. 
The  model  can  be  summarized  by  writing 


P = p + G + E 

TABLE  9.3  PHENOTYPES  OF  VARIOUS  GENO- 

TYPES  AS  THE  SUM  OF 

p,  G,  AND  Ea 

Genotype 

Phenotypic  Value 

AA 

P + Gi  + E j 

AA 

+ G]  + £2 

AA 

p + Gj  + E3 

AA' 

Ji  4-  G2  + £4 

AA' 

p + G2  + £5 

AA' 

p + g2  + e6 

A'A' 

p + G3  + £7 

A' A' 

p + G3  + Eg 

A'A' 

p + G3  + Eg 

P is  the  population  mean.  G is  a contribution  due  to  genotype, 
different  for  each  genotype.  E is  a contribution  due  to  environ- 
ment, different  for  each  individual. 
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where  P represents  the  phenotypic  value  of  any  individual  and  G and  E are 
the  genotypic  and  environmental  deviations  pertaining  to  that  individual. 

To  connect  the  above  symbols  with  actual  numbers,  we  may  use  Table  9.2 
and  assume  an  allele  frequency  of  A of  p = 0.2.  Equation  9.12  then  implies 
that  the  mean  of  the  population  is  p = 0.994.  Thus,  the  respective  G,,  G2,  and 
G3  deviations  for  AA,  AA' , and  A' A'  genotypes  are 

G]  = 1.202  - 0.994  = 0.208 
G2  = 1.059 -0.994  = 0.065 
G3  = 0.948  - 0.994  = -0.046 

For  a particular  animal  of  genotype  AA  whose  actual  coat  color  score  is,  for 
example,  1.312,  the  corresponding  value  of  E for  the  animal  would  be  calcu- 
lated using  Equation  9.25  from  the  expression  1.312  = 0.994  + 0.208  + E;  thus, 
for  this  animal,  E = 0.11.  Similarly,  a particular  animal  of  genotype  AA'  with 
an  actual  phenotype  of  P = 1.009  would  have  a value  of  E given  by  1.009  = 
0.994  + 0.065  + E,  or  E = -0.05.  Because  the  E values  are  defined  as  deviations 
from  their  mean,  the  average  of  Es  for  any  genotype  is  0.  Likewise,  since  the 
Gs  are  defined  as  deviations  from  their  mean,  the  mean  of  the  Gs  is  0.  This 
result  can  be  verified  in  the  guinea  pig  example  because 

(0.2)2G!  + 2(0.2)(0.8)G2  + (0.8)2G3  = 0 

Equation  9.25  is  appropriate  when  the  effects  of  genotype  and  environment 
are  additive — that  is,  when  the  deviation  of  the  phenotype  of  any  particular 
individual  from  the  population  mean  (P  - p)  can  be  written  as  the  sum  of  an 
effect  resulting  from  the  genotype  of  that  individual  and  a separate  effect 
resulting  from  the  environment  of  that  individual. 


PROBLEM  9.4  In  Problem  9.3  the  values  of  p,  a,  and  d were  found  to 
be  0.31,  0.67,  and  0.02,  respectively,  for  the  logarithms  of  tomato 
weight.  Calculate  the  additive  genetic  variance  in  the  F2  population, 
the  Bi  population  and  the  B2  population. 


ANSWER  In  the  F2  population  the  allele  frequency  isp  = q = V2,  so 
the  formula  for  the  additive  genetic  variance  (Equation  9.24)  is  a2  = 
2pqa2  = y2a2  = 0.224.  In  the  backcross  1 population  B],  the  allele 
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frequencies  are  p = 3/4  and  q = y4/  so,  applying  Equation  9.24,  we  get 
= 0.173.  The  backcross  2 population  B2  has  allele  frequencies 
p = V4  and  q = 3/4,  so  = 0.163.  When  the  dominance  parameter  is  so 
small,  the  additive  variance  is  at  a maximum  when  the  allele  frequen- 
cies are  both  y2,  and  the  graph  of  additive  variance  against  allele  fre- 
quency is  symmetric. 


To  this  point  the  discussion  has  been  restricted  to  a particular  population 
in  a single  macroenvironment,  and  the  sources  of  variation  have  been  due  to 
genetic  and  microenvironmental  differences  among  individuals.  A change 
in  macroenvironment  is  easiest  to  see  in  an  experimental  setting,  where,  for 
example,  in  macroenvironment  1 all  of  the  guinea  pigs  get  twice  as  much 
food  as  in  macroenvironment  2.  Additivity  of  genetic  and  environmental 
effects  is  true  whenever  the  ratio  of  G! : G2 : G3  is  the  same  in  each  of  the  rele- 
vant environments.  For  the  genotypes  in  Figure  9.13,  for  example,  if  the  actu- 
al range  of  environments  is  the  range  designated  Elt  then  the  genetic  and 
environmental  effects  are  additive  because  the  ratio  Gl : G2 : G3  is  the  same  for 


Environment— ► 


Figure  9.1 3 The  norm  of  reaction  is  the  relation  between  the  phenotype  and  the 
environment,  and  this  relation  is  known  to  vary  from  genotype  to  genotype. 
Hypothetical  norms  of  reaction  for  genotypes  AA,  A A'  and  A! A!  are  shown  here. 
In  the  range  of  environments  denoted  Eu  A is  very  nearly  dominant  to  A'tthat 
is,  A A and  A A'  have  nearly  the  same  phenotype).  However,  in  the  range  E2,  A 
and  A'  are  very  nearly  additive  (no  dominance).  The  heritability  of  the  trait 
resulting  from  this  gene  differs  according  to  whether  the  population  is  reared  in 
E]  environments  or  E2  environments. 
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any  particular  environment  in  Eh  For  the  same  reason,  the  genetic  and  envi- 
ronmental effects  are  additive  if  the  actual  range  of  environments  is  E2. 

However,  if  the  actual  range  of  environments  includes  both  Ej  and  E2, 
then  the  ratio  Gj : G2 : G3  depends  on  the  particular  environment,  and  there- 
fore the  genetic  and  environmental  effects  may  not  be  additive.  Nonadditiv- 
ity of  genetic  and  environmental  effects  is  called  genotype-environment 
interaction,  and  in  writing  Equation  9.25,  it  appears  that  there  is  an 
assumption  that  there  is  no  genotype-environment  interaction.  In  actually 
estimating  components  of  variance,  it  is  not  necessary  to  assume  that  there 
is  no  genotype-environment  interaction,  because  we  can  explicitly  exam- 
ine the  phenotypes  when  reared  in  different  macroenvironments  and 
directly  estimate  the  magnitude  of  the  interaction.  Alternatively,  we  can 
arbitrarily  define  the  environmental  variance  as  also  including  the  effects 
of  genotype — environment  interaction. 

When  Equation  9.25  is  valid,  the  total  phenotypic  variance  of  in  the 
population  equals  the  mean  of  (P  - p)2.  However,  Equation  9.25  implies  that 
(P  - p)2  equals  (p  + G + E - p)2,  which  is 

of  = (G  + E)2  = G2  + 2GE  + E2  9.26 

Because  G and  E are  already  deviations  from  their  means,  the  mean  of  G2 
is  the  phenotypic  variance  in  the  population  resulting  from  differences  in 
genotype,  and  the  mean  of  E2  is  the  phenotypic  variance  resulting  from  dif- 
ferences in  environment.  The  mean  of  G~  is  called  the  genotypic  variance 
and  is  denoted  o2.  The  mean  of  E2  is  called  the  environmental  variance  and 
is  denoted  of.  The  remaining  term — the  mean  of  2GE — is  two  times  the 
genotype-environment  covariance.  If  the  genotypic  and  environmental 
deviations  are  uncorrelated — that  is,  if  there  is  no  systematic  association 
between  genotype  and  environment — then  there  is  said  to  be  no 
genotype-environment  association  and  the  mean  of  2GE  equals  zero.  When 
there  is  no  genotype-environment  association,  therefore, 

of  = of  + of  9.27 

Equation  9.27  is  the  theoretical  foundation  for  partitioning  the  variance 
into  genetic  and  environmental  effects.  The  assumption  that  genotype-envi- 
ronment association  is  negligible  is  frequently  a valid  assumption  in  animal 
and  plant  breeding  where,  because  breeders  have  a degree  of  control  not 
available  to,  for  example,  human  geneticists,  experiments  can  be  intentional- 
ly designed  in  such  a way  as  to  minimize  genotype-environment  association. 
However,  genotype-environment  association  can  occur  even  in  animal  and 
plant  breeding.  For  example,  dairy  farmers  routinely  provide  more  feed  sup- 
plements to  cows  that  produce  more  milk;  because  milk-producing  ability  is 
partly  due  to  genotype,  this  feed  regimen  will  provide  superior  environ- 
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ments  (better  feed)  to  cows  that  have  superior  genotypes  to  begin  with,  so 
there  will  be  a genotype-environment  association.  Similarly  the  best  race 
horses  get  the  best  trainers  and  the  children  of  the  best  students  often  go  to 
the  best  schools.  If  one  is  not  careful  to  correct  for  such  associations,  geno- 
type-environment association  can  inflate  the  apparent  of  and  possibly  give 
spurious  overestimates  of  heritability. 

The  biological  meaning  of  Equation  9.27  is  shown  for  the  alleles  of  one 
gene  in  Figure  9.14.  The  solid  curves  represent  the  phenotypic  distributions 
in  the  genotypes  AA,  AA' , and  A' A'  with  means  denoted  G,,  G2,  and  G3,  and 
the  dashed  curve  represents  the  phenotypic  distribution  in  the  entire  popula- 
tion. The  total  phenotypic  variance  of  is  the  variance  of  the  dashed  distribu- 
tion; the  genotypic  variance  of  is  the  variance  among  the  Gs  (i.e.,  of  = p2G\  + 
2 p<?Gj  + q2Gf,  where  p is  the  allele  frequency  of  A);  and  the  environmental 
variance  of  is  obtained  by  subtraction:  of  = of  - of.  Although  the  Gs  are  not 
generally  known,  of  must  equal  zero  in  a genetically  uniform  population. 
The  observed  variance  of  a randomly  bred  population,  therefore,  provides  an 
estimate  of  of  + of,  whereas  the  average  observed  variance  of  genetically  uni- 
form populations  provides  an  estimate  of  of.  The  estimate  of  of  is  obtained 
by  subtraction,  as  shown  in  an  example  using  thorax  length  in  Drosophila 
(Table  9.4).  In  this  case,  genetic  variation  among  individuals  in  the  randomly 
bred  population  accounts  for  about  0.180/0.366  = 49.2%  of  the  phenotypic 
variance.  Genetically  uniform  populations  such  as  inbred  lines  or  crosses 
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Figure  9.14  Phenotypic  distribution  (dashed  curve)  of  a quantitative  trait  in  a 
hypothetical  population,  showing  distributions  (solid  curves)  of  three  con- 
stituent genotypes  for  two  alleles  of  a gene.  The  means  of  AA,  AA',  and  A' A' 
genotypes  are  denoted  G\,  G2,  and  G3,  respectively. 
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TABLE  9.4 

CALCULATION  OF  GENOTYPIC  VARIANCE  (a2)  AND 
ENVIRONMENTAL  VARIANCE  (Oe)3 

POPULATIONS 

Variance 

Random-bred 

Uniform 

Theoretical 

_2  , 2 

Gy  + Ge 

o,2 

Observed 

0.366 

0.186 

a2  = 0.186 
a*  = (a*  + a2) 

-a2  = 0.366 -0.186  = 0.180 

Source:  Data  from  Robertson  1957. 

" Trait  is  length  of  thorax  in  Drosophila  melanogaster  (in  units  of  1CT2  mm). 


between  inbreds  are  not  available  in  human  populations,  but  identical  twins 
are  often  used  instead  because  of  the  identical  genotypes  of  the  twins. 

An  example  of  a naturally  occurring  organism  that  exhibits  remarkably 
low  levels  of  genetic  variability  is  the  African  cheetah  (O'Brien  et  al.  1983, 
1987;  May  1995).  One  might  suppose  that  limited  genetic  variation  would 
result  in  depressed  phenotypic  variability  as  well,  but  a study  of  cranial  mea- 
sures by  Wayne  et  al.  (1986)  revealed  that  the  amount  of  variability  was  not 
appreciably  less  than  that  in  three  other  large  cats.  In  fact,  there  was  a signif- 
icant increase  in  the  amount  of  fluctuating  asymmetry  (that  is,  the  difference 
in  measurements  from  the  left  and  the  right  side  of  the  body).  The  fluctuating 
asymmetry  is  consistent  with  the  notion  that  genetic  homozygosity  results  in 
reduced  developmental  stability — an  idea  that  has  considerable  empirical 
support,  but  so  far  no  good  explanation  in  molecular  terms.  In  any  event, 
reduction  of  genetic  variance,  and  concomitant  high  homozygosity,  can  result 
in  an  increase  in  phenotypic  variance  due  to  developmental  instability.  Since 
extreme  homozygosity  may  result  in  phenotypes  that  are  very  sensitive  to 
environmental  fluctuations,  the  paradoxical  increase  in  phenotypic  variance 
results  from  genotype-environment  interaction. 

Components  of  Genotypic  Variation 

So  far,  the  phenotypic  variance  has  been  partitioned  into  the  genotypic 
variance  and  the  environmental  variance  according  to  the  Equation  9.27. 
The  genotypic  variance  can  be  partitioned  further  into  terms  that  are  par- 
ticularly important  for  interpreting  the  resemblance  between  relatives. 
The  appropriate  model  is  shown  in  Table  9.5,  where  the  phenotypic  means 
of  AA,  AA',  and  A' A'  are  denoted  1 by  p*  + a,  p*  + d,  and  p*  - a,  as  they 
were  earlier  in  Figure  9.11.  To  obtain  the  G values,  the  mean  of  each  geno- 
type must  be  expressed  as  a deviation  from  the  population  mean,  which  is 
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TABLE  9.5  EXPRESSIONS  FOR  POPULATION  MEAN  AND  GENOTYPIC  DEVIATIONS 


Genotype 

Frequency 

Mean  phenotype 

Genotypic  deviation  from 
population  mean  (G) 

AA 

2 

P 

\i*  + a 

Gi  = p*  + a - p = 2q[a  + (q  - p)d]  - 2 q2d 

AA' 

2pq 

p * + d 

G2  = p*  + d - p = (q  - p)[n  + (q  - p)d]  + 2 pqd 

A' A' 

r 

p*  - a 

G3  = ]i*-a-p  = ~2p[a  + (q  - p)d]  - 2 q2d 

Population  mean  p = p2(p*  + a)  + 2pq(p*  + d)  + q2(\i*  - a ) 
= (p2  + 2pq  + q2)p*  + (p2  - q2)a  + 2pqd 
= (P  + TV  + (p-  c\){p  + q)a  + 2 pqd 
= p*  + (p  - q)a  +2 pqd 


P - P*  + (p  - <?)«  + 2 pqd,  and  the  deviations  are  shown  in  the  last  column 
of  Table  9.5.  The  genotypic  variance  Og  is  calculated  as 

o2  = p2G\  + 2vqG\  + q2Gl 

, 9.28 

= 2 pq[a  + (q-  p)d]2  + (2 pqd)2 

The  first  term  in  Equation  9.28  is  the  additive  genetic  variance  a2  encoun- 
tered earlier  in  Equation  9.24.  The  second  term  is  a new  quantity  called  the 
dominance  variance,  which  is  symbolized  o2.  From  Equation  9.28,  there- 
fore, 

a2  = 0,7  + 05  9.29 

which  allows  us  to  express  the  total  phenotypic  variance  as  the  sum  of  three 
terms,  namely 

of7  = a2  + 0,7  + o,2  9.30 

When  Equation  9.22  for  heritability  is  written  in  terms  of  variance  compo- 
nents rather  than  p,  q,  a and  d,  the  equation  implies  that 

h 2 = a2/ o2  9.31 

Equation  9.31  is  an  important  result  because  it  states  that  the  heritability 
depends  only  on  the  additive  genetic  variance  and  not  on  the  dominance 
variance.  Therefore,  if  all  the  genetic  variance  in  a population  results  from 
dominance  variance  (i.  e.,  o^  = 0),  then  the  population  cannot  respond  to  indi- 
vidual selection  because  h2  equals  zero.  To  say  the  same  thing  in  another  way, 
the  dominance  variance  o2  represents  that  portion  of  the  genetic  variance  that 
is  not  acted  upon  by  individual  selection. 

Equation  9.31  means  that  the  heritability  of  a trait  is  the  ratio  of  the  addi- 
tive genetic  variance  to  the  total  phenotypic  variance.  Sometimes  the  word 
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heritability  is  used  in  reference  to  a different  variance  ratio,  namely  the  ratio  of 
the  total  genotypic  variance  to  the  total  phenotypic  variance  (i.e.,  of /of).  To 
avoid  confusion,  quantitative  geneticists  distinguish  the  two  types  of  heri- 
tability as  follows: 

1.  The  ratio  of/ of  is  called  heritability  in  the  narrow  sense  (This  is  the  vari- 
ance ratio  we  have  been  using  all  along.) 

2.  The  ratio  of/ of  is  called  heritability  in  the  broad  sense. 

Generally  speaking,  narrow-sense  heritability  is  the  more  important  with 
individual  selection  (or  any  mode  of  selection  that  capitalizes  primarily  on 
the  additive  genetic  variance),  whereas  broad-sense  heritability  is  the  more 
important  when  selection  is  practiced  among  clones  (a  clone  is  a group  of 
genetically  identical  individuals),  inbred  lines,  or  varieties.  We  use  the  term 
heritability  to  mean  narrow-sense  heritability  unless  otherwise  stated. 

As  emphasized  earlier,  heritability  has  no  transparent  interpretation  in 
simple  genetic  terms.  The  same  is  true  of  the  variance  components  of  and  of. 
Even  for  a single  gene,  the  variance  components  depend  on  the  particular 
values  of  a and  d (Figure  9.15),  and  of  course  the  estimates  of  heritability 
must  also  depend  on  allele  frequency  (Figure  9.16).  With  many  genes  that  act 
together,  of  is  defined  as  a summation  of  the  values  of  2 pq[a  + (q  - p)d]2  for 
each  gene  affecting  the  trait,  and  of  represents  a summation  of  the  values  of 
(2 pqd)2  for  each  gene.  Furthermore,  when  the  trait  is  affected  by  multiple 
genes,  the  formula  for  of  in  Equation  9.29  must  be  extended  to  include  an 
additional  term  that  pertains  to  interaction  among  the  genes.  This  interaction 
term  is  called  the  interaction  variance  or  the  epistatic  variance  and  is  sym- 
bolized o;.  With  the  interaction  variance  included.  Equation  9.29  becomes 

of  = of  + of  + of  9.32 

The  important  point  to  remember  about  the  components  of  genotypic 
variance  is  that  they  represent  the  cumulative,  statistical  effects  of  all  genes 
affecting  the  trait.  Few  inferences  about  the  actual  mode  of  inheritance  of  the 
trait  are  possible  from  the  variance  components,  particularly  concerning  the 
number  of  genes  involved  and  their  individual  effects. 


PROBLEM  9.5  By  definition,  a simple  Mendelian  trait  is  one  that  is 
determined  entirely  by  genotype  in  the  prevailing  environment. 
Therefore,  of  = 0 in  Equation  9.27,  and  the  broad-sense  heritability 
of /of  = 1.  Show  that,  for  a simple  Mendelian  recessive,  the 
narrow-sense  heritability  equals  2q/(l  + q),  where  q is  the  recessive 
allele  frequency. 


Quantitative  Genetics 


433 


Figure  9.15  Total  genetic  variance  (a^),  additive  genetic  variance  (c^), 
and  dominance  variance  (a2)  for  a locus  with  two  alleles  ( A and  A')  plotted 
against  the  frequency  of  allele  A (p).  The  mean  phenotypes  of  AA,  A A’  and  A' A' 
are  denoted  p*  + a,  p*  + d,  and  p*  - a,  respectively.  In  all  cases,  we  have  that  a2  = 
2pq[a  + (q-  p)d]2,  a2  = (2pqd)2,  and  ag  = a;  + a2.  (A)  a = d = 0.0701  (A  dominant  to 
A');  (B)  a = 0.1,  d = 0 (no  dominance),  (C)  d = -a  = 0.0707  (A'  dominant  to  A);  (D) 
a = 0,  d = 0.141  (overdominance).  For  ease  of  comparison,  the  values  of  a and  d 
have  been  chosen  to  make  the  maximum  of  ag  equal  to  0.005  in  each  case. 


ANSWER  Let  the  phenotypes  of  AA,  AA',  and  A' A'  be  assigned  phe- 
notypic values  0, 0,  and  1,  respectively,  so  that  the  A'  allele  is  recessive. 
In  this  case,  p*  = 1/2,  a = -1/2,  and  d = -1/2.  The  numerator  of  Equa- 
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Figure  9.16  Narrow-sense  heritability  due  to  a single  locus  with  two  alleles  (A 
and  A')  as  a function  of  p,  the  allele  frequency  of  A.  In  general,  for  one  locus, 
h2  = 2 pq[a  + (q  - p)d]2 / a2,  where  of,  is  the  total  phenotypic  variance.  The  curves 
correspond  to  a = 0.1,  and  d = 0.1  (A  dominant),  d = 0 (no  dominance),  and  d = 
-0.1  (A'  dominant). 


tion  9.31  is  the  additive  genetic  variance,  which  equals  2 pq3.  Then  the 
mean  phenotype  equals  q2,  and  the  variance  in  phenotypic  value  is 
q2-(q2)2  = q2(  1 - q2)  = q2(l  + q)(l  -q)  = pq\  1 + q).  The  heritability  is  the 
additive  variance  divided  by  the  phenotypic  variance,  namely 
2pq3/[pq2(l  + q)]  = 2 q/ (1  + q).  When  the  autosomal  recessive  trait  is 
rare,  q ~ 0,  and  the  heritability  is  approximately  equal  to  the  frequency 
of  heterozygous  carriers. 


COVARIANCE  BETWEEN  RELATIVES 

Components  of  genetic  variation  are  important  because  they  may  be  used  to 
express  the  phenotypic  covariance  between  relatives.  Since  the  distribution 
of  offspring  from  a given  parental  genotype  depends  on  the  distribution  of 
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potential  mates,  the  variance  components  and  estimates  of  heritability 
depend  not  only  on  allele  frequencies  but  also  on  the  distribution  of  geno- 
type frequencies.  To  simplify  things,  we  assume  that  the  trait  is  determined 
by  one  gene  with  two  alleles,  and  that  the  population  is  in  Hardy-Weinberg 
proportions.  However,  the  same  results  are  also  true  for  many  genes  when 
the  trait  is  determined  by  summing  the  individual  allelic  effects  provided 
that  the  population  is  in  multilocus  linkage  equilibrium. 

Table  9.6  displays  three  genotypes  of  parents,  their  genotypic  value,  and 
the  mean  genotypic  value  of  the  offspring  with  random  mating.  The  covari- 
ance of  the  offspring  and  one  parent  is  calculated  by  summing  the  product  of 
the  last  three  columns  of  Table  9.6  and  subtracting  the  product  of  the  means 
of  the  last  two  columns.  After  tedious  algebra,  the  covariance  of  offspring  and 
parent  is: 

oOP  = pq  [a  + (q-  p)d]2  = V2g2  9.33 

This  is  a remarkably  simple  result  because  it  says  that  the  covariance 
in  phenotype  of  parents  and  offspring  is  one-half  the  additive  genetic  vari- 
ance. No  component  of  dominance  inflates  the  covariance  in  this  case. 
Since  environmental  effects  are  assumed  to  be  random  with  respect  to 
genotypic  values  (there  is  no  genotype-environment  correlation),  environ- 
mental effects  also  play  no  role  in  parent-offspring  covariance.  We  must, 
however,  assume  that  the  environments  of  parents  and  offspring  are  uncor- 
related for  Equation  9.33  to  be  valid.  In  order  to  see  the  relation  between 
narrow-sense  heritability  and  regression,  recall  that  the  regression  coeffi- 
cient is  defined  as 


b = oxy/o 2 9.34 

In  this  case,  the  regression  of  offspring  on  one  parent  is,  from  Equation  9.33: 

bOP  = Oop/ Op 

= V2  o2/o2  9.35 

= \ h2 


TABLE  9.6  DERIVATION  OF  PARENT-OFFSPRING  COVARIANCE 


Parent's  genotype 

Frequency 

Genotypic  value a 

Offspring  mean 
genotypic  value 

AA 

P2 

2 q(a  - pd) 

aq  + dq(q  - p) 

A'A 

2 pq 

a(q  -p)  + d(  1 - 2 pq) 

V2q(q  -p)  + V2d(q  - 1 

A'  A’ 

q2 

-2 p(a  + qd) 

-ap  - dp(q  - p) 

0 Genotypic  values  are  expressed  as  deviations  from  the  population  mean. 
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In  the  regression  of  offspring  on  the  midparent  (that  is,  the  average  value  of 
the  parents),  the  denominator  in  Equation  9.35  becomes  V2o2  because  this  is 
the  variance  of  the  mean  of  the  two  parents,  assuming  random  mating. 
Hence  the  regression  coefficient  equals  (V2a2)/(V2o2),  and  so  the  regression 
coefficient  of  offspring  on  midparent  equals  the  narrow-sense  heritability. 

The  same  reasoning  can  be  followed  to  obtain  covariances  between  other 
pairs  of  relatives,  as  summarized  in  Table  9.7.  As  can  be  seen,  the  additive 
genetic  variance  can  be  estimated  directly  either  from  parent-offspring 
covariance  or  from  half-sib  covariance.  However,  full-sib  covariance  includes 
a term  resulting  from  dominance.  The  expressions  in  Table  9.7  are  correct  as 
long  as  there  are  no  complications  such  as  genotype-environment  associa- 
tions or  other  nonrandom  environmental  effects  such  as  full  sibs  sharing 
environmental  factors  common  to  the  whole  family  but  not  shared  by  other 
families.  Since  the  total  variance  in  phenotypic  value  a2  can  be  estimated 
directly,  once  a2  is  estimated  from  the  covariance  between  relatives,  the 
narrow-sense  heritability  can  be  estimated  from  Equation  9.31.  The  first  three 
relationships  in  Table  9.7  are  the  most  useful  in  quantitative  genetics  and  are 
commonly  used  in  animal  and  plant  breeding.  The  other  relationships  are 
used  mainly  in  human  quantitative  genetics. 

The  genetic  covariance  between  various  relatives  can  also  be  derived 
using  the  concepts  of  gene  identity  developed  by  Cotterman  (1940)  and 
extended  by  Crow  and  Kimura  (1970).  In  these  terms,  the  generalized  covari- 
ance for  a pair  of  related  individuals  is 

Cov(x,y)  = r o2  + u o2  9.36 

where  the  coefficients  r and  u are  determined  from  coefficients  of  coancestry. 
The  coefficient  of  coancestry,  Fxy,  of  two  individuals  x and  y is  the  inbreeding 


TABLE  9.7  THEORETICAL  COVARIANCE  IN  PHENOTYPE  BETWEEN 
RELATIVES3 

Degree  of  relationship 

Covariance 

Offspring  and  one  parent 

o2/2 

Offspring  and  average  of  parents  (midparent) 

o2/2 

Half  siblings 

o2/4 

Full  siblings 

(0,7/2)  + (oj/4) 

Monozygotic  twins 

ol  + o2 

Nephew  and  uncle 

o„2/4 

First  cousins6 

o„2/8 

Double  first  cousins 

(o2/4)  + (o,2/16) 

0 Variance  terms  due  to  interaction  between  loci  (epistasis)  have  been  ignored. 
b First  cousins  are  the  offspring  of  matings  between  siblings  and  unrelated  individuals;  double 
first  cousins  are  the  offspring  of  matings  between  siblings  from  two  different  families. 
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coefficient  of  a hypothetical  offspring  of  x and  y.  If  individuals  A and  B are 
the  parents  of  x,  and  C and  D are  the  parents  of  y,  then  the  r and  u coefficients 
in  Equation  9.36  are: 


u - f ac  ^bd  + f"AD  Fbc 

It  is  an  instructive  exercise  to  write  out  pedigrees  for  some  of  the  relations  in 
Table  9.7,  calculate  the  coefficients  of  coancestry,  and  verify  that  Equation 
9.36  gives  the  correct  result. 

Figure  9.17  presents  the  narrow-sense  heritabilities  of  diverse  quantitative 
traits  in  farm  animals  and  one  important  crop  plant  as  estimated  from  the  cor- 
relation between  relatives.  The  data  are  presented  merely  to  show  the  values 
of  heritability  with  which  breeders  typically  must  deal.  It  is  important  to  keep 
in  mind  that  the  heritabilities  in  Figure  9.17  pertain  to  one  population  in  one 
type  of  environment  at  one  particular  time.  The  same  trait  in  a different  pop- 
ulation or  in  a different  environment  might  well  have  a different  heritability. 
Generally  speaking,  traits  that  are  closely  related  to  fitness  (such  as  calving 
interval  in  cattle  or  eggs  per  hen  in  poultry)  tend  to  have  rather  low  heritabil- 
ities. Ignoring  complications  such  as  antagonistic  pleiotropy  (discussed  later), 
long-term  natural  selection  is  expected  to  gradually  reduce  the  additive  genet- 
ic variance  until  the  effect  is  balanced  against  the  input  of  new  mutations. 

For  purposes  of  comparison.  Figure  9.18  shows  estimated  broad-sense 
heritabilities  of  a number  of  quantitative  traits  in  humans.  Broad-sense  heri- 
tabilities vary  widely  for  different  traits,  as  they  do  in  other  species.  Note  the 
low  heritability  of  fertility,  a trait  that  is  obviously  closely  related  to  fitness. 
At  the  other  end  of  the  scale  is  total  fingerprint  ridge  count,  which  is  appar- 
ently not  a major  component  of  fitness  considering  its  relatively  high  broad- 
sense  heritability. 

Although  it  is  tempting  to  think  about  resemblance  between  relatives  in 
terms  of  classical  genetic  analysis  such  as  Mendel  did,  there  are  major  differ- 
ences between  the  approaches.  When  the  data  are  measurements  of  a contin- 
uous character  in  family-structured  samples  from  a population,  estimates  of 
statistical  components  of  variance  can  be  obtained.  However,  these  compo- 
nents depend  on  allele  frequencies  and  environmental  conditions,  and  thus 
quantities  such  as  heritability  are  far  removed  from  the  basic  level  of  gene 
action.  Direct  experimental  assessment  of  variance  components  (e.g., 
Mitchell-Olds  1986)  shows  that  different  populations  do  have  different  heri- 
tabilities of  many  traits.  Moreover,  a trait  with  high  heritability  does  not 
mean  that  the  trait  cannot  be  affected  by  the  environment.  For  example, 
phenylketonuria,  which  is  caused  by  homozygosity  for  a single  defective 
allele  for  the  enzyme  phenylalanine  hydroxylase,  is  a simple  Mendelian 
recessive,  yet  the  phenotype  of  severe  mental  retardation  can  be  completely 
circumvented  by  a diet  low  in  phenylalanine. 
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Figure  9.1 7 Narrow-sense  heritabilities  for  representative  traits  in  plants  and 
animals.  Traits  closely  related  to  fitness  (calving  interval,  eggs  per  hen,  litter 
size  of  swine,  yield  and  ear  number  of  corn)  tend  to  have  rather  low  heritabili- 
ties. (Animal  data  from  Pirchner  1969,  who  gives  the  range  of  heritabilities  in 
various  studies.  The  midpoint  of  the  range  is  plotted  here.  Corn  data  from 
Robinson  et  al.  1949.) 


PROBLEM  9.6  Consider  the  following  hypothetical  experiment  on 
plant  growth.  Seeds  were  removed  from  six  plants  and  offspring  were 
grown  in  two  different  light  conditions.  Height  of  the  plant  at  eight 
weeks  was  measured  in  the  parental  plants  and  all  the  progeny, 
giving  the  following  data: 
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Midparent 

Offspring  with 
full  sunlight 

Offspring  with  10% 
full  sunlight 

1.10 

1.12 

0.63 

1.54 

1.53 

1.02 

1.23 

1.22 

0.74 

1.06 

1.04 

0.53 

1.47 

1.43 

0.91 

1.38 

1.32 

0.83 

Calculate  the  heritability  in  both  environments. 


ANSWER  The  variance  in  the  midparents  can  be  calculated  as 
~ [(£*/)  /6] }/5  = 0.0391.  The  covariance  between  midparents  and 
offspring  at  full  sunlight  is  [Lx,y,  - [(I.r,Iy,)/6]}/5  = 0.0365,  so  the  nar- 
row sense  heritability  is  0.0365/.0391  = 0.93.  At  10%  full  sunlight,  the 
midparent-offspring  covariance  is  0.0353,  and  the  heritability  is 
0.0353/0.0391  = 0.90.  This  example  illustrates  the  important  principle 
that  a trait  may  have  a very  high  heritability,  yet  the  phenotypic  mean 
is  still  strongly  altered  by  a change  in  the  environment.  Actually,  the 
heritability  itself  may  also  change  as  one  moves  from  one  environ- 
ment to  another. 


The  distinction  between  estimation  of  variance  components  and  knowl- 
edge  of  genetic  causes  of  human  differences  has  particularly  important  impli- 
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Figure  9. 1 8 Broad-sense  heritabilities  and  ranges  of  heritabilities  of  various 
traits  in  humans.  Uncertainties  about  the  correlation  between  environments  of 
relatives  make  such  estimates  in  humans  very  tentative.  (Data  from  Smith  1975.) 


440 


Chapter  9 


cations  for  social  applicability  of  human  quantitative  genetics.  Some  of  the 
problems  are  forcefully  conveyed  by  Lewontin  (1974)  and  Feldman  and 
Lewontin  (1975).  In  particular,  estimates  of  heritability  within  a population, 
even  if  they  are  sound,  tell  us  nothing  about  the  degree  to  which  genetic  dif- 
ferences account  for  the  differences  in  phenotypes  between  populations  (see 
Problem  9.6).  Experimentalists  working  with  organisms  that  can  be  manipu- 
lated in  the  field  or  laboratory  can  be  more  rigorous  in  the  assessment  of 
genetic  parameters  by  examining  the  traits  in  several  environments,  and  by 
doing  studies  that  combine  classical  and  quantitative  genetic  analysis. 

Twin  Studies  and  Inferences  of  Heritability  in  Humans 

Because  identical  twins  are  genetically  identical,  phenotypic  differences 
between  identical  twins  would  seem  to  be  a straightforward  measure  of  how 
much  phenotypic  variance  is  caused  by  environment.  Twin  studies  raise 
their  own  unique  problems,  however,  and  the  results  must  be  interpreted 
with  caution.  Before  discussing  the  use  of  twins  in  quantitative  genetics,  we 
should  back  up  a few  steps  and  first  discuss  the  phenomenon  of  twinning 
itself. 

Twins  are  relatively  frequent  among  human  births,  though  the  rate  of 
twinning  varies  from  population  to  population.  Among  Caucasians  in  the 
United  States,  for  example,  about  one  in  88  births  results  in  twins;  among 
Japanese  in  Japan,  the  rate  is  about  one  in  145  births  (Bulmer  1970).  Two 
kinds  of  twins  actually  occur.  Identical  twins,  often  called  monozygotic  or 
one-egg  twins,  arise  from  a single  zygote  that  very  early  in  embryonic  devel- 
opment splits  into  two  distinct  clumps  of  cells,  each  clump  thereafter  under- 
going its  own  embryonic  development.  Because  they  arise  from  a single 
zygote,  identical  twins  are  necessarily  genetically  identical.  The  other  kind  of 
twins  are  called  fraternal  twins,  dizygotic  twins,  or  two-egg  twins.  Fraternal 
twins  arise  from  a double  ovulation  in  the  mother,  each  egg  being  fertilized 
by  a different  sperm.  Because  of  their  mode  of  origin,  fraternal  twins  are 
related  genetically  as  siblings.  Most  of  the  variation  in  twinning  rates  in 
humans  is  due  to  variation  in  the  rate  of  dizygotic  twinning.  For  example,  the 
rates  of  monozygotic  twinning  among  Caucasians  in  the  United  States  and 
among  Japanese  in  Japan  are  one  in  256  and  one  in  238,  respectively,  whereas 
the  respective  rates  of  dizygotic  twinning  are  one  in  135  and  one  in  370  (Bul- 
mer 1970). 

For  studies  in  quantitative  genetics,  identical  twins  are  often  compared 
with  same-sex  fraternal  twins  in  order  to  discount  the  effects  of  common 
intrauterine  environments.  Such  an  approach  is  only  partially  successful,  as 
identical  twins  often  share  embryonic  membranes  in  utero  (the  amnion  and 
chorion)  that  are  not  usually  shared  by  fraternal  twins.  Moreover,  because 
identical  twins  often  have  astonishingly  similar  facial  features,  they  may  be 
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treated  more  similarly  by  parents,  teachers,  and  peers  than  are  fraternal 
twins.  Some  of  these  problems  can  be  overcome  by  studying  twins  that  are 
raised  apart  (in  different  households),  but  data  of  this  sort  are  usually  limited 
(Shields  1962),  making  estimates  of  heritability  highly  imprecise.  Even  when 
twins  are  reared  apart,  the  environments  into  which  they  are  adopted  are 
generally  similar.  This  effect  of  correlated  environments  has  the  effect  of 
inflating  the  apparent  degree  to  which  traits  are  genetically  determined.  In 
any  case,  if  rMZ  and  rDZ  represent  the  correlation  coefficients  of  a quantitative 
trait  among  monozygotic  and  dizygotic  twins,  then  2(rMZ  - rDZ)  provides  a 
rough  estimate  of  the  broad-sense  heritability  of  the  trait.  To  see  where  this 
formula  comes  from,  first  look  at  Table  9.7.  The  covariance  of  monozygotic 
twins  in  the  absence  of  environmental  correlation  is  a2  + oj,  so  the  correla- 
tion between  monozygotic  twins  is  this  covariance  divided  by  the  phenotyp- 
ic variance,  or  the  broad-sense  heritability.  If  monozygotic  and  dizygotic 
twins  have  the  same  degree  of  environmental  correlation,  then  subtracting 
the  correlation  of  one  from  the  other  should  remove  the  environmental  cor- 
relation. The  correlation  between  dizygotic  twins  is  the  same  as  that  of  full 
sibs,  or  y2ca2  + y4c2.  Assuming  that  the  phenotypic  variance  is  the  same  in 
both  types  of  twins,  the  expression  2(rMZ  - rDZ)  is  equal  to  [o„  + zli02^\l  Q2, 
which  is  not  exactly  equal  to  the  broad-sense  heritability,  but  it  is  an  approx- 
imation (Smith  1975).  Even  when  the  mathematically  precise  estimators  are 
used,  the  problem  of  shared  environments  does  not  go  away. 

One  human  trait  that  has  received  an  inordinate  amount  of  attention  is 
intelligence.  The  estimation  of  heritability  of  intelligence  from  twin  studies  is 
steeped  in  controversy.  Aside  from  the  necessity  to  define  intelligence  as  per- 
formance on  an  IQ  test,  these  studies  face  enormous  hurdles  in  obtaining 
accurate  assessments  of  causes  of  patterns  of  similarity.  Ever  since  the  data  of 
Cyril  Burt  showing  a heritability  of  IQ  of  0.771  were  cast  in  doubt,  there  have 
been  efforts  to  revive  claims  of  a very  high  heritability  of  IQ.  Often  the  lan- 
guage is  imprecise,  with  claims  like  "about  70%  of  the  variance  in  IQ  was 
found  to  be  associated  with  genetic  variation"  (Bouchard  et  al.  1990).  Even 
with  the  best  care  taken  to  study  only  adopted  twins  reared  apart,  the  prob- 
lem of  correlated  environments  makes  it  impossible  to  obtain  an  entirely  reli- 
able estimate.  An  important  point  that  is  frequently  overlooked  in  this 
discussion  is  that  a high  heritability  implies  nothing  about  the  ability  to 
change  a trait  by  modification  of  the  environment.  The  question  has  been 
raised  whether  there  is  any  societal  good  to  be  gained  from  knowledge  of 
heritability  of  IQ;  the  problem  makes  for  lively  reading  on  both  sides  (Lewon- 
tin  et  al.  1984;  Herrnstein  and  Murray  1996).  Fortunately  for  evolutionary 
biologists,  organisms  that  can  be  reared  in  controlled  conditions  do  afford  the 
opportunity  to  obtain  meaningful  estimates  of  heritability  and  other  compo- 
nents of  quantitative  genetic  variation,  as  described  in  the  next  section. 
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EXPERIMENTAL  ASSESSMENT  OF 
GENETIC  VARIANCE  COMPONENTS 

Population  biologists  typically  estimate  the  heritability  and  genetic  correla- 
tions of  a trait  for  the  purpose  of  addressing  the  genetic  constraints  on  the 
evolution  of  the  trait.  The  best  experimental  approach  depends  on  whether 
the  organisms  are  sampled  directly  from  a natural  population,  whether  a 
series  of  inbred  lines  is  available,  and  whether  laboratory  rearing  is  practical. 
Because  the  components  of  variance  (including  heritability)  are  descriptions 
of  a particular  population  in  a particular  environment,  the  ideal  would 
appear  to  be  to  use  naturally  occurring  individuals,  keeping  track  of  their 
familial  relationships,  and  to  fit  the  statistical  models  relating  degrees  of 
relationship  to  expected  covariances.  In  practice,  because  the  natural  envi- 
ronment is  so  variable,  it  is  often  preferable  to  do  analyses  in  a controlled  lab- 
oratory environment,  but  this  introduces  other  problems  outlined  below. 

Estimation  of  genetic  variance  components  in  natural  populations  gener- 
ally requires  a number  of  restrictive  assumptions:  (1)  diallelic  inheritance,  (2) 
no  correlation  of  parental  and  offspring  environments,  (3)  no  linkage  or  link- 
age disequilibrium,  (4)  parents  equally  inbred,  (5)  offspring  not  inbred,  (6) 
samples  of  relatives  drawn  at  random  from  a noninbred  population,  (7)  no 
mutation,  migration,  selection,  and  (8)  random  mating  (see  Mitchell-Olds 
and  Rutledge  1986  and  references  therein).  In  practice,  small  violations  of 
these  assumptions  are  acceptable,  but  it  is  nevertheless  a serious  challenge  to 
obtain  reliable  estimates  of  genetic  variance  components  from  samples  from 
natural  populations.  Most  commonly,  a full  analysis  is  not  carried  out  in  nat- 
ural populations,  but  offspring  collected  from  known  mothers  are  studied. 
One  can  also  estimate  a lower  bound  on  the  heritability  in  a natural  popula- 
tion by  regression  of  measurements  of  laboratory-reared  offspring  on  the 
measurements  of  parents  sampled  from  nature  (Riska  et  al.  1989). 

Once  measurements  are  obtained  on  individuals  with  known  degrees  of 
relationship,  the  partitioning  of  variance  into  additive,  dominance,  and  envi- 
ronmental components  can  be  done  using  the  standard  statistical  method  of 
analysis  of  variance.  Many  experimental  designs  permit  the  effects  of  various 
factors  on  quantitative  genetic  components  to  be  estimated.  For  example, 
assays  of  variance  components  could  be  repeated  under  several  different 
environmental  regimes,  and  the  environmental  component  in  the  analysis  of 
variance  could  be  estimated.  A second  generation  of  organisms  could  also  be 
studied  and  the  variance  components  estimated  from  parent-offspring 
covariance.  However,  the  two  most  common  designs  are  analysis  of  variance 
of  full-sib  families  or  half-sib  families.  Using  full-sib  families  alone  has  the 
problem  that  the  covariance  of  full  sibs  includes  both  additive  and  domi- 
nance variance.  Therefore,  when  the  data  are  limited  to  full-sib  families,  all 
that  can  be  estimated  is  the  broad-sense  heritability.  On  the  other  hand,  if  one 
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studies  only  half-sibs  (whose  covariance  is  V4c^),  one  can  estimate  the  nar- 
row-sense heritability,  but  the  dominance  component  may  be  quite  large  and 
remain  undetected. 

More  elaborate  designs  that  feature  parents,  full  sibs  and  half  sibs  can 
give  extensive  partitioning  of  variance  components.  The  reliability  of  the 
methods  can  be  tested  by  comparing  heritability  estimates  from  either  par- 
ent-offspring regression  or  from  the  covariance  of  half  sibs.  One  means  of 
using  all  the  data  in  a single  estimate  is  the  method  of  maximum  likelihood, 
a procedure  for  parameter  estimation  which  solves  for  parameter  values  that 
have  the  maximum  likelihood  of  obtaining  the  observed  data  under  a given 
model.  Maximum  likelihood  methods  are  preferable  to  analysis  of  variance 
when  several  traits  are  being  examined  to  estimate  the  components  of  genet- 
ic covariance.  Analysis  of  variance  becomes  extremely  cumbersome  when 
sample  sizes  are  not  the  same  at  all  levels  of  relationship,  and  there  is  no 
single  ideal  way  to  adjust  for  unequal  sample  sizes. 

The  principle  behind  maximum  likelihood  is  to  construct  a likelihood 
function  that  describes  the  likelihood  of  obtaining  the  observed  data  given 
the  family  structure  and  a set  of  unknown  parameters  to  be  estimated.  The 
unknown  parameters  are  the  magnitudes  of  the  various  genetic  and  environ- 
mental variance  components.  The  method  then  finds  the  values  of  the 
unknown  parameters  that  maximize  the  likelihood.  In  practice,  the  unknown 
parameters  form  a variance-covariance  matrix,  and  the  computer  algorithms 
entail  extensive  matrix  manipulations.  The  computer  output  consists  of  esti- 
mates of  heritability  that  utilize  all  the  data,  including  parent-offspring,  full- 
sib,  half-sib,  and  any  other  relationships  that  are  informative.  Although  no 
assumption  of  multivariate  normality  of  the  character  data  is  necessary  in 
obtaining  the  estimates,  the  estimates  have  meaning  (and  the  prediction 
equation  is  reliable)  only  if  the  phenotypes  are  normally  distributed.  Testing 
the  statistical  significance  of  the  estimates  requires  normality.  Shaw  (1987) 
provides  a thorough  review  of  the  merits  of  maximum  likelihood  methods  in 
quantitative  genetics. 

Heritability  tells  us  virtually  nothing  about  the  actual  mode  of  inheritance 
of  a quantitative  trait,  useful  as  the  concept  may  be  in  predicting  response  to 
selection.  The  heritability  of  a trait  represents  the  cumulative  effect  of  all 
genes  that  affect  the  trait.  Even  if  a trait  is  determined  by  a single  gene,  heri- 
tability depends  in  a complex  manner  on  the  values  of  p,  a,  and  d,  and  these 
individual  components  cannot  be  disentangled.  (The  values  of  p,  a,  and  d are 
said  to  be  statistically  confounded.)  With  more  than  one  gene,  the  heritabili- 
ty includes  a summation  of  terms  for  each  gene,  and  each  term  has  its  own 
particular  values  of  p,  q,  a,  and  d.  Here,  precisely,  is  the  problem:  for  a quanti- 
tative trait  determined  by,  say,  10  diallelic  loci,  there  would  be  30  quantities 
involved  in  heritability — 10  allele  frequencies,  10  values  of  a,  and  10  values  of 


444 


Chapter  9 


d.  Heritability  is  but  a single  number  that  gives  the  combined  effect  of  all  30 
quantities.  It  says  nothing  about  any  one  of  them. 

It  must  be  emphasized  that  heritability  is  a quantity  that  comes  from  a 
mathematical  model  of  reality,  and  the  model  has  many  assumptions.  We 
have  assumed  that  all  genes  affecting  the  trait  act  independently  of  one 
another  and  are  unlinked.  In  actual  cases,  genes  often  interact  and  can  be 
linked.  (The  model  can,  however,  be  extended  to  incorporate  at  least  partial- 
ly the  effects  of  linkage  and  epistasis.)  Moreover,  the  assumption  of  no  corre- 
lation of  parental  and  offspring  environments  is  not  always  easy  to  test.  All  in 
all,  while  heritability,  especially  realized  heritability,  is  an  indispensable  aid 
to  plant  and  animal  breeders,  it  lends  itself  to  no  easy  interpretation  in  sim- 
ple genetic  terms  (apart  from  the  statistical  description  through  parent- 
offspring regression).  Another  difficulty  in  interpreting  heritability  values  is 
that  they  depend  on  the  range  of  environments  that  occur.  The  denominator 
of,  in  Equation  9.23  is  the  total  variance  in  phenotypic  value  in  the  popula- 
tion. Because  the  total  variance  includes  the  variance  resulting  from  environ- 
mental differences  among  individuals,  increasing  the  variation  in  the 
environment  decreases  lr.  Exceptionally  thoughtful  discussions  of  the  con- 
cept of  heritability  and  its  strengths  and  limitations  are  found  in  Kempthorne 
(1978)  and  Jacquard  (1983). 

Heritability  values  are  determined  in  part  by  gene  frequencies.  Because 
gene  frequencies  change  during  the  course  of  selection,  the  heritability  is  also 
expected  to  change.  In  practice,  however,  the  heritability  changes  sufficiently 
slowly  that  over  the  course  of  a few  generations,  it  can  be  regarded  as  approx- 
imately constant.  The  approximate  constancy  of  heritability  has  a twofold 
cause:  (1)  if  a particular  gene  accounts  for  only  a small  proportion  of  the  total 
phenotypic  variance  in  a quantitative  trait,  then  the  gene  frequency  does  not 
change  very  rapidly,  and  (2)  the  values  of  a and  d remain  nearly  constant  pro- 
vided that  the  environment  does  not  change  drastically  from  one  generation 
to  the  next.  Thus,  at  least  for  the  first  10  generations  or  so,  heritability  usual- 
ly remains  approximately  constant  and  can  be  used  as  a constant  in  the  pre- 
diction equation  (Equation  9.10).  To  be  precise,  suppose  h2  is  constant  and  let 
p,  and  S,  represent  the  mean  of  the  population  and  the  selection  differential 
in  the  fth  generation.  Then,  over  the  length  of  time  during  which  h 2 is  approx- 
imately constant, 

M/  ~ Bo  = h\S0  + Sj  + + S(_j)  9.38 

The  quantity  p,  - p0  is  the  total  response  to  selection,  and  S0  + Sj  + + SfA 

is  called  the  cumulative  selection  differential.  During  the  time  in  which  h 2 is 
approximately  constant,  therefore,  a plot  of  pf  against  cumulative  selection 
differential  is  expected  to  yield  a straight  line  with  slope  equal  to  h2,  as  illus- 
trated for  a case  in  mice  in  Figure  9.19. 
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Figure  9.19  Linearity  in  response  against  the  cumulative  selection  differential 
for  body  weight  in  mice  at  age  six  weeks.  Linearity  in  the  up  (high-weight) 
direction  continues  for  about  twice  as  long  as  it  does  in  the  down  (low-weight) 
direction.  (After  Falconer  1955.) 


Indirect  Estimation  of  the  Number  of  Genes  Affecting 
a Quantitative  Character 

The  number  of  genes  that  contribute  to  quantitative  traits  is  not  always  large. 
We  have  already  seen  an  example  of  seed  color  in  wheat  in  which  the  num- 
ber of  genes  was  three.  When  the  number  of  genes  is  relatively  small,  the 
number  can  often  be  estimated  from  the  means  and  variances  observed  in 
different  strains  and  their  hybrids  and  backcrosses.  In  the  case  of  two  addi- 
tive alleles  of  each  of  three  genes,  when  parental  strains  are  homozygous  for 
all  unfavorable  or  all  favorable  alleles,  then  they  differ  by  six  units  in  pheno- 
typic  value.  The  variance  in  phenotypic  value  in  the  F2  generation  equals  3/2 
units".  If  there  are  n unlinked,  additive  genes,  the  difference  D in  phenotyp- 
ic value  between  the  means  of  parental  inbred  lines  is  2 n units,  and  the  vari- 
ance o'  in  the  F2  generation  is  n/2  units2. 

In  order  to  obtain  an  estimate  of  the  number  of  genes  that  is  independent 
of  the  units  of  measurement,  a ratio  is  needed  to  make  the  units  cancel.  One 
possibility,  first  suggested  by  Wright  (in  Castle  1921),  is 

, D2 

9.39 

With  n unlinked,  additive  genes,  h = (In)2 / [8{n / 2)]  = n,  as  it  should. 
Equation  9.39  is  based  on  the  assumptions  of  complete  additivity,  equal 
effects  of  all  genes,  no  linkage,  and  fixed  differences  between  parental  lines. 
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When  the  assumptions  are  violated,  application  of  the  equation  usually  results 
in  estimates  of  gene  number  that  are  smaller  than  the  actual  numbers.  For  this 
reason,  the  quantity  estimated  in  Equation  9.39  is  called  the  effective  number 
of  genes  because  it  defines  a lower  limit  to  the  actual  number.  Figure  9.20  pre- 
sents the  results  of  a simulation,  using  a range  of  2 to  10  genes  to  generate  sam- 
ple data  to  which  Equation  9.39  was  applied  to  estimate  n.  The  message  is  that 
this  method  is  very  approximate  and  somewhat  biased.  The  statistical  proper- 
ties of  the  method  have  been  improved  (Zeng  1992),  hut  the  estimate  is  still 
only  a rough  approximation  to  the  number  of  genes  affecting  a trait. 

The  variance  in  Equation  9.39  is  the  genetic  variance,  which  is  the  variance 
in  phenotypic  value  resulting  from  genetic  differences  among  individuals. 
When  the  environment  contributes  an  amount  a,2  to  the  phenotypic  variance, 
then  the  variance  within  the  parental  inbred  lines,  or  within  the  Fn  population, 
equals  ot2.  This  is  because  the  populations  are  genetically  uniform,  and  the 
only  source  of  variation  in  phenotype  results  from  the  environment.  Howev- 
er, within  the  F2  generation,  variation  in  phenotype  results  in  part  from  genet- 
ic variation  and  in  part  from  environmental  variation,  and  the  total  variance  in 
phenotypic  value  equals  the  summation  a2  + a2.  Therefore,  subtraction  of  the 
Fi  variance  from  the  F2  variance  gives  an  estimate  of  o2  because 

o2  = [a2  + a,2]  - a,2  9.40 

Further  discussion  of  the  genetic  variance  occurs  later  in  this  chapter. 
Lande  (1981)  gives  several  alternative  methods  of  estimating  the  genetic  vari- 
ance using  data  from  inbreds,  hybrids,  and  backcrosses.  Cockerham  (1986) 


Figure  9.20  Computer-generated  samples  of  an  F2  population  having  a range 
of  numbers  of  loci  with  purely  additive  effects  that  influence  a trait.  For  each 
sample,  the  number  of  genes  was  estimated  following  Wright's  method. 
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extended  the  analysis  to  obtain  an  unbiased  estimation  of  the  difference  in 
parental  means,  and  he  combined  the  data  from  parentals,  F,,  F2,  and  back- 
crosses  into  a single  least-squares  estimate. 

The  scale  in  which  a phenotype  is  measured  is  important  in  using  Equa- 
tion 9.39  because  the  genes  and  alleles  are  assumed  to  be  additive.  In  the 
ideal  additive  case,  a plot  of  the  means  and  variances  of  inbreds,  hybrids,  and 
backcrosses  forms  a triangle  of  the  type  shown  in  Figure  9.21A.  In  actual 
cases,  the  phenotypes  must  be  measured  using  a scale  that  yields  an  approx- 
imate triangle.  With  fruit  in  tomatoes,  for  example,  an  obvious  scale  of 
weight  is  in  grams,  but  the  scale  giving  the  triangle  in  Figure  9.21  B is 
x = lo^[(weight  in  grams)  - 0.153]  (Lande  1981).  Using  this  scale,  D = 1.552 
and  a"  = 0.0426,  from  which  Equation  9.39  implies  h = (1.552)2/ [8(0.0426)]  = 
7 as  the  effective  number  of  genes. 

(A)  (B) 

P Tomato,  fruit  weight 


Mean  Mean 


Figure  9.21  (A)  Expected  triangular  relation  between  means  and  variance  of 

phenotypic  value  among  inbred  parents,  P,  backcross  progeny,  B,  and  hybrids,  F, 
for  an  ideal  quantitative  trait  determined  by  unlinked  and  completely  additive 
genes.  (B)  Observed  relation  for  fruit  weight  in  tomato,  in  which  the  fruit 
weight  is  on  a logarithmic  scale.  (After  Lande  1981.) 


PROBLEM  9.7  In  analyzing  data  on  oil  content  in  maize,  Lande 
(1981)  found  that  values  of  log[(percent  oil  in  kernels)  + 1.87]  approx- 
imated a triangular  form  like  those  in  Figure  9.21.  With  this  scale  of 
measurement,  the  means  of  two  inbred  lines  were  0.513  and  1.122, 
respectively,  and  the  phenotypic  variances  of  the  inbred  lines,  the  Fj 
generation,  and  the  F2  generation  were  0.00142,  0.00053,  0.00030,  and 
0.00303,  respectively.  Use  Equations  9.39  and  9.40  to  estimate  the 
effective  number  of  genes.  Estimate  the  environmental  variance  in 
two  different  ways:  (1)  as  the  mean  of  the  variance  in  phenotypic 
value  in  the  parental  inbred  lines,  (2)  as  equal  to  the  variance  in  phe- 
notypic value  in  the  Fx  generation. 
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ANSWER  Using  the  first  estimate,  d?=  (0.00142  + 0.00053)/2  = 
0.00098,  d2  = 0.00303  - 0.00098  = 0.00205,  D = 0.513  - 1.122  = -0.609, 
and  h = (-0.609)2/8(0. 00205)  = 22.6  Using  the  second  estimate, 
a2  = 0.00303  - 0.00030  = 0.00273,  and  h = 17.0. 


The  effective  number  of  genes  for  oil  content  in  maize  kernels  calculated  in 
Problem  9.7  bracket  those  obtained  by  Lande  (1981)  using  other  estimates  of 
the  environmental  variance.  Some  of  the  estimates  of  number  of  genes  affect- 
ing quantitative  traits  are:  tomato  fruit  weight,  7-11;  maize  kernel  oil  content, 
17-22,  Hawaiian  Drosophila  head  size,  6-9,  fish  eye  diameter,  5-7,  and  human 
skin  color,  4-6.  With  the  notable  exception  of  oil  content  in  maize  kernels,  the 
effective  number  of  genes  is  less  than  10.  These  examples  demonstrate  that 
genetic  variation  in  quantitative  traits  can  result  from  the  segregation  of  a rel- 
atively small  number  of  genes.  However,  the  effective  number  of  genes  repre- 
sents a minimum  estimate,  and  the  actual  number  of  genes  is  likely  to  be 
larger.  In  some  cases  the  estimated  number  of  genes  is  very  large.  For  exam- 
ple, at  least  40  genes  contribute  antigens  that  are  important  in  the  rejection  of 
skin  transplants  in  mice,  at  least  150  genes  contribute  to  body  weight  in  the 
mouse  and  Tribolium,  and  at  least  17  genes  on  the  third  chromosome  of 
Drosophila  melanogaster  influence  sternopleural  bristle  number  (Shrimpton  and 
Robertson  1988).  It  should  be  emphasized  that  the  number  of  genes  that  affect 
quantitative  traits  may  differ  according  to  the  precise  definition  of  "affect." 
For  example,  the  overall  long-term  selection  response  illustrated  in  Figure  9.5 
may  result  from  a large  number  of  genes,  but  most  of  the  response  that  occurs 
at  any  one  time  may  result  from  changes  in  allele  frequency  in  only  a few  of 
them.  As  we  will  see  in  the  last  section  of  this  chapter,  the  use  of  molecular 
markers  to  genetically  map  the  position  of  genes  affecting  quantitative  traits  is 
beginning  to  give  more  direct  assessments  of  the  number  of  genes  and  the  dis- 
tribution of  their  effects  on  quantitative  characters. 


NORM  OF  REACTION  AND  PHENOTYPIC  PLASTICITY 

In  considering  the  influence  of  environment  on  the  determination  of  pheno- 
types, it  is  tempting  to  think  of  environmental  effects  as  random  noise  added 
to  traits  that  are  basically  genetically  determined.  This  view  is  often  mis- 
leading. One  simple  way  to  analyze  the  effect  of  environment  on  phenotype 
is  to  examine  the  phenotypes  of  a single  genotype  in  an  array  of  environ- 
ments; it  is  possible  to  do  this  examination  in  a number  of  experimental 
organisms.  The  array  of  phenotypes  that  results  from  a given  genotype  is 
known  as  the  norm  of  reaction,  a term  coined  by  Schmalhausen  (1949). 
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Figure  9.22  shows  the  norm  of  reaction  of  bristle  number  for  10  strains  of 
Drosophila  pseudoobscura  reared  at  three  temperatures  (Gupta  and  Lewontin 
1982).  The  rank  ordering  of  the  strains  was  found  to  change  at  different  tem- 
peratures, and  an  astounding  35  to  40%  of  pairwise  comparisons  of  strains 
reversed  in  ranking  when  one  temperature  was  compared  to  another.  Figure 
9.22  shows  a greater  variation  among  lines  at  14°  than  at  26°.  If  we  were  to 
do  a test  of  parent-offspring  regression  at  14°  and  compare  it  to  the  same  test 
at  26°,  the  former  would  have  a higher  heritability.  This  contrast  underscores 
the  fact  that  heritability  is  a measure  defined  in  one  environment,  and  it 
shows  that  the  methods  of  quantitative  genetics  using  analysis  of  variance 
cannot  separate  the  causes  of  phenotypic  differences.  In  order  to  detect  envi- 
ronmental effects,  the  norm  of  reaction  must  be  ascertained  by  examining 
phenotypes  in  a variety  of  environments. 

Referring  again  to  the  one-gene  model  for  quantitative  traits,  the  environ- 
ment can  affect  the  value  of  heritability  because  the  values  of  a and  d depend 
on  the  environment.  Figure  9.13  maybe  used  as  an  example;  it  shows  the 
norms  of  reaction  of  AA,  AA',  and  A' A'  genotypes.  If  we  were  dealing  with 
the  range  of  environments  denoted  Ei  in  Figure  9.13,  A would  be  the  favored 
allele  and  A would  be  nearly  dominant  to  A'.  On  the  other  hand,  if  we  were 
dealing  with  the  range  of  environments  denoted  £2,  A'  would  be  the  favored 
allele  and  there  would  be  essentially  no  dominance.  Thus,  switching  a popu- 
lation from  E\  to  E2  would  change  the  values  of  a and  d and  substantially  alter 
the  heritability  of  the  trait,  even  though  the  total  phenotypic  variance  of  the 
population  might  remain  the  same. 


Figure  9.22  By  counting  the  mean  number  of  sternital  bristles  of  a set  of 
genotypes  (lines)  of  Drosophila  reared  at  different  temperatures,  differences  in 
the  norm  of  reaction  are  apparent  from  the  crossing  lines  in  the  plot. 
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PROBLEM  9.8  Dobzhansky  and  Spassky  (1944)  studied  the  norms 
of  reaction  with  respect  to  temperature  of  the  viability  of  various 
genotypes  containing  chromosomes  extracted  from  natural  popula- 
tions of  Drosophila  pseudoobscura.  For  two  such  chromosomes  (desig- 
nated A and  B),  the  following  relative  viabilities  were  obtained. 

Genotype 

Temperature  (°C) A/4  A/B  B/B 

165  0.92  1.00  0.71 

25-5  0.32  1.00  0.75 

Estimate  p*,  a,  and  d for  these  genotypes  in  populations  main- 
tained at  16.5  C and  25.5  C.  Then,  letting  p represent  the  frequency  of 
the  A chromosome  and  q that  of  the  B chromosome,  assume  p = 0.3, 
and  estimate  the  additive  genetic  variance  of  viability  resulting  from 
these  genotypes  at  both  temperatures. 


ANSWER  At  16.5°  C,  p = (0.92  + 0.71)/2  = 0.815,  a = 0.92  - 0.815  = 
0.105,  d = 1.00  - 0.815  = 0.185.  At  25°C,  p = 0.535,  a = -0.215,  d = 0.465. 
Additive  genetic  variance  for  these  genotypes  is  given  by  Equation 
9.24,  where  p = 0.3,  and  q = 0.7.  At  16.5°C,  a„2  = 0.01346.  At  25.5°C, 
<Sa  = 0.00035.  Note  that  the  additive  genetic  variance  has  been 
decreased  by  a factor  of  almost  40,  yet  all  we  did  was  raise  the  tem- 
perature! 


The  norm  of  reaction  is  important  in  evolutionary  genetics  because  the 
fate  of  genetic  variation  in  a population  depends  on  the  fitness  of  the  organ- 
ism, which  in  turn  depends  on  the  environment.  In  turn,  the  norm  of  reaction 
is  itself  a property  that  may  be  under  genetic  control  and  be  subject  to  adap- 
tive evolution  by  means  of  natural  selection  (Schlichting  and  Pigliucci  1994; 
Via  et  al.  1995).  In  the  context  of  adaptive  norms  of  reaction,  evolutionary 
geneticists  often  apply  the  term  phenotypic  plasticity.  It  is  interesting  to  con- 
sider whether  natural  selection  modifies  phenotypic  plasticity  as  part  of 
adaptation,  especially  since  extreme  phenotypic  plasticity  is  not  always 
favorable.  For  example,  if  a plant  germinates  from  a seed  in  an  exceptionally 
dry  year  because  its  genotype  has  a high  level  of  phenotypic  plasticity  result- 
ing in  the  ability  to  cope  very  well  with  dry  periods,  then  in  the  following 
season  the  plant  may  have  insufficient  leaf  area  to  compete  with  its  less  plas- 
tic neighbors  that  are  phenotypically  better  suited  for  average  moisture.  On 
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the  other  hand,  it  is  also  easy  to  describe  a scenario  in  which  the  absence  of 
plasticity  can  prove  fatal.  A useful  way  to  model  the  evolution  of  adaptive 
phenotypic  plasticity  is  to  consider  the  phenotype  in  different  environments 
as  different  traits  that  may  be  genetically  correlated  (Falconer  and  Mackay 
1996;  Via  and  Lande  1985).  Further  discussion  of  this  model  will  be  deferred 
until  after  we  have  discussed  selection  of  more  than  one  character. 

One  biological  factor  that  affects  the  correspondence  between  genotypes 
and  phenotypes  is  the  chronological  age  of  an  organism.  A phenotype,  such 
as  body  weight,  often  changes  with  age,  and  different  genotypes  can  have 
different  age-related  growth  curves  (or  other  developmental  profiles).  Conse- 
quently, the  heritability  of  traits  depends  on  the  age  at  which  individuals  are 
tested.  Figure  9.23  shows  components  of  variance  of  mouse  body  weight  at 
different  ages  (Riska  et  al.  1984).  In  this  study,  2700  mice  from  700  full-sib 
families  were  measured  from  days  14  to  70.  Variance  of  all  components  was 
maximal  at  an  age  of  about  20  days  (which  happens  to  be  the  time  of  maxi- 
mal growth),  and  then  declined  to  fairly  stable  values  at  age  40  days.  Despite 


Figure  9.23  Variance  components  for  the  logarithm  of  body  weight  in  a 
random-bred  (genetically  variable)  strain  of  mice  plotted  as  a function  of 
age.  (7„,  is  the  variance  due  to  maternal  effects,  and  o;.r  - o;  - o}„ . (From 
Atchley  1984.) 
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the  additive  genetic  variance  decreasing  markedly  after  20  days  of  age,  the 
heritability  itself  actually  increased  with  age,  with  values  of  0.22,  0.27,  0.31, 
and  0.37  at  days  12,  20,  50,  and  70.  Although  there  is  an  assumed  increase  in 
environmental  variance  in  many  traits  in  humans,  there  is  no  marked  ten- 
dency for  heritability  to  decrease  with  age. 


THRESHOLD  TRAITS  AND  THE  GENETICS  OF  LIABILITY 

Some  multifactorial  traits  do  not  exhibit  continuous  variation.  Although  the 
variation  is  discontinuous  (individuals  either  express  the  trait  or  not),  the 
trait  is  nevertheless  influenced  by  multiple  genetic  factors  and  also  by  envi- 
ronment. Such  traits  are  called  threshold  traits.  A human  example  is  dia- 
betes, an  abnormality  in  sugar  metabolism  that  affects  one  or  two  percent  of 
the  Caucasian  population.  In  a sense,  diabetes  is  a continuous  trait  because 
the  severity  of  the  disease  varies  from  nearly  undetectable  to  extremely 
severe.  On  the  other  hand,  diabetes  can  also  be  considered  a threshold  trait 
because  all  individuals  may  be  classified  according  to  whether  or  not  they 
are  so  severely  affected  that  clinical  treatment  is  required.  With  such  a clas- 
sification, there  are  only  two  phenotypes,  "affected"  and  "not  affected,"  even 
though  there  is  phenotypic  variation  within  each  category.  The  genetic  influ- 
ence on  the  trait  is  shown  by  the  enhanced  risk  of  diabetes  in  relatives  of 
affected  individuals.  However,  environmental  factors  such  as  diet  are  also 
important  in  determining  whether  high-risk  genotypes  actually  develop  the 
disease.  At  one  time,  many  threshold  traits  were  "explained"  by  postulating 
a simple  genetic  mechanism  (such  as  a single  recessive  allele  in  the  case  of 
diabetes)  and  invoking  "incomplete  penetrance"  to  account  for  the  poor  fit 
of  pedigree  data  to  a simple  Mendelian  hypothesis.  Now  it  is  preferred,  and 
probably  more  realistic,  to  consider  threshold  traits  as  true  polygenic  traits 
and  to  calculate  heritabilities  as  for  any  other  quantitative  trait.  In  most 
cases,  however,  it  is  simply  not  known  whether  the  genetic  influence  on  the 
trait  results  from  one  or  a few  major  genes  or  is  polygenic.  Here  again,  the 
use  of  molecular  markers  to  map  quantitative  trait  loci  will  be  able  to  dis- 
criminate between  these  two  possibilities. 

The  basic  idea  behind  the  model  of  threshold  traits  is  illustrated  in  Figure 
9.24.  The  normal  curve  in  panel  (A)  represents  the  (unobservable)  distribu- 
tion of  a hypothetical  liability  (or  risk)  toward  the  threshold  trait,  measured 
on  a scale  such  that  the  mean  value  is  0 and  the  variance  is  1 . It  is  assumed 
that  individuals  whose  liability  is  above  a certain  threshold  (T)  actually 
express  the  trait.  Thus,  the  shaded  area  in  Figure  9.24A  delimits  the  propor- 
tion of  individuals  in  the  population  who  are  affected  (Bp),  and  the  mean  lia- 
bility among  affected  individuals  is  denoted  ps.  Figure  9.24B  gives  the  (again 
unobservable)  distribution  of  liability  among  the  offspring  of  affected  indi- 
viduals. The  offspring  mean  is  denoted  p',  and  the  proportion  of  offspring 
above  the  threshold  is  denoted  B0.  The  setup  here  is  like  that  in  the  earlier 
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Figure  9.24  (A)  Distribution  of  liability  assumed  for  a threshold  trait  in  a 

hypothetical  population.  The  shaded  area  denotes  individuals  having  liability 
above  a critical  threshold  (T)  and  consequently  affected  with  the  trait.  Bp  is  the 
frequency  of  affected  individuals  in  the  entire  population,  p is  the  mean  liability 
of  individuals  in  the  entire  population,  and  ps,  the  mean  liability  of  affected 
individuals.  (B)  Distribution  of  liability  among  offspring  who  have  one  parent 
affected  with  the  trait,  p'  denotes  the  mean  liability  among  offspring,  and  B0  is 
the  proportion  of  affected  offspring. 


section  of  this  chapter  in  which  we  calculated  the  regression  coefficient  of 
offspring  on  one  parent.  In  this  model  it  can  be  shown  that  the  regression 
coefficient  b is  given  by  b = p'/ps/  and  the  appropriate  estimate  of  the  heri- 
tability  of  liability  is  obtained  from  the  relation 

h2  = 2 b = 2p'/ps  9.41 

The  methods  for  calculating  heritability  of  threshold  traits  are  illustrated  in 
Problem  9.9. 


PROBLEM  9.9  For  pyloric  stenosis,  the  incidence  among  males  in 
the  general  population  is  Bp  = 0.005,  and  the  incidence  among  sons  of 
affected  males  is  B0  = 0.05.  If  liability  follows  a normal  distribution, 
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and  0.005  is  the  frequency  of  affected  individuals,  this  represents  2.89 
standard  deviations  above  the  mean.  From  these  two  numbers,  infer 
ps  and  p'  in  order  to  calculate  the  heritability  of  liability. 


ANSWER  The  mean  liability  of  the  fathers  is  ps  = 2.89.  Bp  can  be 
obtained  from  a table  of  the  normal  distribution  as  follows.  The  tables 
generally  give  the  value  on  the  x axis  in  standard  deviation  units  for  an 
observed  area  in  the  two  tails  of  the  distribution.  If  the  fraction  affect- 
ed is  0.05,  and  this  is  the  area  in  one  tail,  then  the  area  in  both  tails  is 
0.10.  A probability  area  of  greater  than  0.10  is  obtained  if  an  observa- 
tion is  more  than  2.58  standard  deviations  from  the  mean.  Thus,  T = 
2.58  is  the  threshold.  Using  the  same  reasoning  for  the  sons,  2 B0  = 0.10, 
and  this  area  appears  in  the  tail  of  the  normal  distribution  if  an  obser- 
vation is  1.64  standard  deviations  from  the  mean.  This  means  that 
T-  p'  = 1.64.  We  know  that  T = 2.58,  so  p'  = T - 1.64  = 0.94.  From  Equa- 
tion 9.41  we  get  h2  = 2p'/ps  = 2(0.94)/ (2.89)  = 0.65.  This  is  the  estimate 
of  the  narrow  sense  heritability  of  liability  to  pyloric  stenosis. 


Twins  are  frequently  used  in  human  quantitative  genetics  for  the  study  of 
threshold  traits,  but  twin  data  are  best  expressed  in  terms  of  concordance. 
The  concordance  of  a trait  in  a population  of  twins  is  the  proportion  of  affect- 
ed twins  that  have  affected  co-twins.  For  example,  suppose  that  100  affected 
individuals  are  found  to  be  twins  and  that  in  35  cases  the  co-twin  is  also 
affected.  The  concordance  rate  is  then  35/100  = 35%.  From  the  concordance 
rates  for  monozygotic  and  dizygotic  twins  and  the  incidence  of  the  trait  in 
the  population,  the  correlations  in  liability  between  monozygotic  twins  (0mz) 
and  dizygotic  twins  (o^z)  can  be  calculated  (Figure  9.25).  The  broad-sense 
heritability  is  then  estimated  by  2(oMZ  - gdz),  as  discussed  earlier.  As  with 
quantitative  traits  in  general,  twin  data  are  more  reliable  if  the  twins  are 
reared  apart,  but  it  is  seldom  possible  in  practice  to  obtain  a sufficient  num- 
ber of  such  twin  pairs. 


CORRELATED  RESPONSE  AND  GENETIC  CORRELATION 

Genes  have  pleiotropic  effects  on  phenotype;  that  is,  every  gene  potentially 
affects  every  trait  in  the  organism,  either  as  a primary  effect  or  as  a secondary, 
indirect  effect.  Therefore,  the  alleles  that  are  favorable  for  one  quantitative 
trait  may  have  unfavorable  effects  on  another  quantitative  trait,  and  as  these 
alleles  are  increased  in  frequency  by  artificial  selection  (thereby  improving 
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Figure  9.25  Threshold  concordance  rates  expected  in  monozygotic  twins, 
plotted  against  the  correlation  in  liability  toward  the  trait  and  the  population 
incidence  of  the  trait.  (From  Smith  1975.) 


the  phenotypic  value  with  respect  to  the  selected  quantitative  trait),  the  very 
same  alleles  may  bring  about  a deterioration  of  some  other  aspect  of  perfor- 
mance. Pleiotropy  is  one  cause  of  correlated  response — a change  in  pheno- 
typic value  of  one  trait  that  accompanies  response  to  selection  of  a different 
trait.  A second  possible  cause  of  correlated  responses  is  linkage  disequilibri- 
um (Chapter  3) — a favorable  allele  for  one  trait  that  increases  in  frequency 
under  selection  may  drag  along  with  it  an  allele  of  another,  tightly  linked 
gene  that  has  a detrimental  effect  on  an  unselected  trait. 

Correlated  responses  are  quite  common  in  artificial  selection  and  often, 
but  not  always,  result  in  a deterioration  in  reproductive  performance.  In  the 
case  of  Leghorn  chickens,  for  example,  12  generations  of  selection  for 
increased  shank  length  reduced  the  egg  hatchability  by  nearly  half  (Lerner 
1958).  In  turkeys,  to  take  another  example,  there  was  intense  selection  during 
the  period  1944-1964  for  growth  rate,  body  conformation,  and  body  size,  but 
there  was  also  a steady  decline  in  some  aspects  of  reproductive  fitness  such  as 
fertility,  egg  production,  and  egg  hatchability  (Nordskog  and  Giesbrecht 
1964).  On  the  other  hand,  correlated  responses  can  sometimes  be  useful.  For 
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example,  selection  for  larger  mature  body  size  often  increases  litter  size  in 
mice  and  swine.  If  a trait  has  a low  heritability  or  is  difficult  to  measure,  it  is 
sometimes  possible  to  practice  selection  for  another,  correlated  trait,  obtaining 
progress  in  the  trait  of  interest  by  correlated  response.  Theoretically,  the  max- 
imum response  to  artificial  selection  occurs  when  the  criterion  for  selection  is 
determined  by  a selection  index  (an  averaging  across  several  traits)  which 
takes  genetic  correlations  into  account.  However,  the  theoretical  advantage  of 
index  selection  is  often  overridden  by  practical  difficulties  in  estimating  the 
components  of  the  index  and  implementing  the  selection  procedure. 

From  a theoretical  point  of  view,  the  covariance  between  two  quantitative 
traits  can  be  partitioned  in  a manner  analogous  to  the  partitioning  of  the  vari- 
ance for  one  trait  outlined  in  an  earlier  section;  the  covariance  can  thus  be 
partitioned  into  an  additive  covariance,  a dominance  covariance,  an  environ- 
mental covariance,  and  so  on.  The  most  important  theoretical  result  is  that 
the  amount  of  correlated  response  with  individual  selection  depends  only  on 
the  additive  covariance,  much  as  the  direct  response  to  individual  selection 
depends  only  on  the  additive  variance.  The  components  of  covariance 
between  traits  can  be  estimated  from  the  resemblance  between  relatives,  but 
often  it  is  preferable  to  estimate  the  correlated  response  by  direct  observation 
in  a manner  analogous  to  the  determination  of  realized  heritability  (Falconer 
and  Mackay  1996). 

The  phenotypic  correlation  is  the  correlation  one  would  obtain  by  mea- 
suring two  traits,  say  X and  Y,  and  calculating  the  correlation  coefficient 
directly.  In  symbolic  terms,  the  phenotypic  correlation  is 


Cov,, 

OpxOpy 


9.42 


where  Cov,,  is  the  phenotypic  covariance  and  oPX  and  aPY  are  the  phenotyp- 
ic standard  deviations  of  characters  X and  Y.  Although  correlations  do  not 
partition  in  the  same  way  as  variances,  the  phenotypic  covariance  can  be 
expressed  as  the  sum 

Cov,,  = Cov„  + Cov,  9.43 

where  Cov„  is  the  additive  genetic  covariance,  and  Cov,  is  the  environmen- 
tal covariance.  The  genetic  correlation  (which  is  essentially  the  correlation  of 
additive  genetic  effects)  is  defined  as 


Cov  a 


9.44 


where  Covfl  is  the  additive  genetic  covariance,  and  o^x  and  <X;V  are  the  addi- 
tive genetic  variances  of  the  two  traits.  The  genetic  covariance  is  estimated  in 
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much  the  same  manner  as  the  additive  genetic  variance.  For  example,  the 
additive  genetic  covariance  of  two  traits  between  half  sibs  is  y4Cov„. 

Correlated  response  (CR)  occurs  when  characters  other  than  directly 
selected  traits  respond.  The  magnitude  of  correlated  responses  to  selection  is 
related  to  the  additive  genetic  correlation  between  the  selected  and  correlat- 
ed characters.  The  expected  response  is  expressed  by  the  equation 

C R y = ihxhy^a^PY  9.45 

where  X is  the  directly  selected  character  with  narrow-sense  heritability  h2Xr  i is 
the  intensity  of  selection,  h\  is  the  heritability  of  the  correlated  character  Y,  ra 
is  the  genetic  correlation,  and  oPy  is  the  phenotypic  standard  deviation  of  char- 
acter Y.  The  intensity  of  selection  ( i ) is  defined  as  the  selection  differential 
expressed  as  a multiple  of  the  phenotypic  standard  deviation.  Equation  9.45 
says  that,  all  else  being  equal,  a doubling  in  the  genetic  correlation  will  double 
the  magnitude  of  the  correlated  response. 

It  is  instructive  to  consider  the  results  of  an  artificial  selection  experiment 
in  more  detail.  A 23-generation  artificial  selection  experiment  for  3-to-9 
week  weight  gain  in  rats  (Baker  et  al.  1975)  resulted  in  six  strains  that  were 
characterized  for  17  skull  traits  in  order  to  examine  patterns  of  correlated 
responses  (Atchley  et  al.  1982).  The  characters  included  such  measurements 
as  skull  length,  skull  width,  interorbital  width,  braincase  depth,  mandible 
width,  and  jaw  length.  The  experiment  consisted  of  two  unselected  control 
groups,  two  lines  selected  up  and  two  lines  selected  down  for  weight  gain, 
with  all  lines  originating  from  a common  stock.  The  direct  response  was 
remarkable  for  the  magnitude  of  asymmetry:  the  males  selected  for  greater 
weight  gain  were  3.46  standard  deviations  larger  than  the  controls,  while  the 
down-selected  males  were  1.30  standard  deviations  smaller  than  the  con- 
trols. Similarly,  up-selected  females  were  1.46  standard  deviations  larger, 
and  down  selected  females  were  1.02  standard  deviations  smaller.  Asym- 
metric response  is  generally  attributed  to  opposing  natural  selection,  and  no 
more  specific  mechanism  can  be  offered  here.  Correlated  responses  showed 
similar  degrees  of  asymmetry.  However,  while  the  replicate  lines  showed 
statistical  consistency  in  the  magnitudes  of  direct  responses,  35  of  51  com- 
parisons between  males  in  replicate  lines  were  significantly  different  in  the 
correlated  response. 


PROBLEM  9. 1 0 A herd  of  dairy  cattle  yields  milk  with  a fat  content 
of  3.4%  ± 0.65%  and  a protein  content  of  3.3%  ± 0.45%.  The  heritabili- 
ties  of  these  traits  are  0.60  and  0.70,  respectively,  and  the  genetic  cor- 
relation is  0.55.  If  selection  is  practiced  for  percent  protein  with  a 
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selection  intensity  of  i = 1.5,  what  increase  in  percent  protein  and  per- 
cent fat  would  be  expected?  What  intensity  of  selection  would  pro- 
duce the  same  increase  in  percent  fat  by  direct  selection? 


ANSWER  Because  i = S/a,  Equation  9.10  can  be  written  as  R = iah2 
where  i is  the  intensity  of  selection.  For  percent  protein,  R = 

(1.5) (0.45)(0.7)  = 0.47,  so  expected  percent  protein  = 3.3%  + 0.47%  = 
3.8%.  For  a correlated  response,  use  Equation  9.45,  CR  = 

(1.5) (0.6)I/2(0.70)1/2(0.55)(0.65)  = 0.35,  so  expected  percent  fat  = 3.4%  + 
0.35%  = 3.75%.  The  fat  increase  corresponds  to  a direct  selection  of  i = 
0.35/(0.60)(0.65)  = 0.90. 


The  absence  of  correlated  response  can  sometimes  be  very  informative. 
Artificial  selection  on  spacing  between  vein  junctions  in  the  Drosophila  wing 
might  be  expected  to  result  in  correlated  changes  all  over  the  wing.  When 
Weber  (1992)  did  such  an  artificial  selection  for  a minute  (100  cells)  region  of 
the  wing,  he  obtained  a strong  direct  response  with  no  significant  correlated 
response  in  other  wing  measures.  This  suggests  that  the  control  of  morpho- 
logical development  involves  many  genes,  and  that  independent  selection  for 
minute  aspects  of  wing  morphology  are  possible. 

As  mentioned  above,  the  two  primary  mechanisms  that  produce  genetic 
correlation  are  pleiotropy  and  linkage.  With  pleiotropy,  genes  that  affect  one 
character  also  affect  others.  These  effects  may  be  direct,  as  when  a gene  prod- 
uct has  two  distinct  functions,  or  they  may  be  indirect  in  the  sense  that  more 
than  one  physiological  step  may  connect  the  two  phenotypes.  In  both  cases, 
replacement  of  an  allele  at  the  relevant  gene  will  affect  both  phenotypes. 
Alternatively,  linkage  disequilibrium  can  result  in  genetic  correlation  even 
though  the  genes  affecting  the  two  traits  are  entirely  distinct.  If  there  is  link- 
age disequilibrium,  then  selection  of  one  gene  will  affect  allele  frequencies  of 
nearby  genes,  and  the  end  result  is  a correlation  of  changes  in  two  pheno- 
types. The  effects  of  pleiotropy  versus  linkage  disequilibrium  may  be  distin- 
guished experimentally  by  attempting  to  control  or  eliminate  linkage 
disequilibrium.  By  following  anonymous  molecular  markers  in  crosses  or 
selection  experiments,  it  is  much  easier  to  distinguish  these  two  causes  of 
genetic  correlation,  as  described  later. 

Inference  of  Selection  from  Phenotypic  Data 

The  results  of  models  for  selection  on  quantitative  traits  can  be  used  to 
infer  the  operation  of  natural  selection,  but  in  this  application  one  must  be 
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especially  aware  of  the  assumptions  of  the  models;  some  serious  statisti- 
cal problems  must  also  be  overcome.  Either  by  sampling  a population  at 
two  different  times,  or  by  doing  a cross-sectional  study,  changes  in  the 
phenotypic  distribution  resulting  from  differential  mortality  of  the  differ- 
ent phenotypes  can  be  detected  (Lande  and  Arnold  1983;  Arnold  and 
Wade  1984).  Changes  in  phenotypic  distributions  reflect  the  operation  of 
natural  selection,  despite  the  lack  of  any  proven  changes  in  gene  frequen- 
cy. Human  birth  weight  is  easily  shown  to  be  a character  that  is  under  the 
influence  of  natural  selection,  because  the  mortality  rate  of  very  small  and 
very  big  babies  is  higher  than  the  mortality  of  babies  close  to  the  popula- 
tion average  (Karn  and  Penrose  1951).  Extremes  of  body  weight  also  have 
the  highest  mortality  in  elderly  humans  (Harris  et  al.  1988).  Phenotypic 
selection  will  result  in  changes  in  gene  frequency  to  the  extent  that  the 
selected  phenotypes  are  heritable,  and  selection  of  traits  with  low  heri- 
tability  will  have  minimal  immediate  effects  on  the  genetic  composition  of 
the  population. 

The  correspondence  between  phenotypes  and  relative  fitness  can  be  esti- 
mated from  the  mortality  suffered  between  two  times  of  censusing.  A classi- 
cal example  that  illustrates  the  method  is  a reanalysis  by  Lande  and  Arnold 
(1983)  of  the  data  from  Bumpus  (1899),  involving  136  sparrows  that  had  been 
incapacitated  by  a severe  winter  storm.  Eight  measurements  were  taken  of 
every  bird,  and  about  half  of  the  birds  subsequently  revived.  The  phenotypic 
measures  and  the  fraction  of  birds  that  survived  provide  a way  to  examine 
the  relation  between  the  phenotypic  measures  and  the  fitness.  First,  the  data 
required  transformation  to  the  form  of  a multivariate  Gaussian  distribution. 
After  various  confounding  factors  (such  as  age)  were  shown  to  have  little 
effect  on  mortality,  standard  methods  were  used  to  estimate  the  partial 
regression  coefficients  reflecting  the  effect  of  each  phenotypic  trait  in  the  frac- 
tion of  birds  that  survived.  One  can  imagine  a mapping  from  a set  of  pheno- 
typic measures  to  a multivariate  surface  of  fitnesses.  The  partial  regression 
coefficients  were  standardized  to  units  of  phenotypic  standard  deviation. 
The  weight  character  was  found  to  have  a significant  regression  in  males 
(b  = -0.27  ± 0.09)  and  in  females  ( b = —0.52  ± 0.25).  However,  the  regression 
coefficients  are  negative  because,  contrary  to  expectation,  the  smaller  birds 
had  a lower  mortality.  Similar  methods  were  applied  more  recently  by  Gibbs 
and  Grant  (1987)  to  a species  of  Darwin's  finches  on  the  Galapagos  Island  of 
Daphne  Major.  Generally,  larger  body  size  of  birds  is  favored  during  dry 
years,  evidently  because  the  most  abundant  food  consists  of  large,  hard 
seeds.  After  a prolonged  disruption  in  Pacific  Ocean  currents  known  as  El 
Nino,  there  occurred  a year  with  10  times  the  normal  rainfall,  and  a reversal 
in  the  selection  differential  favored  small  body  size.  The  reversal  was  consis- 
tent with  the  abundant  appearance  of  small  seeds  produced  by  adventitious 
plant  growth. 
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The  Bumpus  method  for  inferring  phenotypic  selection  has  faced  criti- 
cism on  a number  of  grounds.  Mitchell-Olds  and  Shaw  (1987)  pointed  out 
that,  as  in  other  applications  of  multiple  regression,  the  traits  can  be  intercor- 
related  (multicolinear),  the  estimators  may  be  biased  when  traits  are  mea- 
sured with  random  experimental  error,  and  the  estimators  are  not  consistent 
(do  not  converge  to  true  values  with  increasing  sample  size)  when  the  errors 
are  not  identically  distributed  for  all  traits.  Lande  and  Arnold  (1983)  dis- 
cussed the  additional  problem  of  a strongly  selected  character  that  is  not 
among  those  studied.  In  the  end,  Mitchell-Olds  and  Shaw  (1987)  recommend 
experimental  manipulation  to  accompany  purely  observational  regression 
analysis  of  selection. 

The  effect  of  a quantitative  character  on  reproductive  fitness  can  be 
obtained  very  directly  in  some  circumstances.  When  the  component  of  selec- 
tion is  mating  success,  a simple  comparison  of  phenotypes  of  mating  and 
nonmating  individuals  constitutes  such  a test.  For  example,  Taylor  and  Keki^ 
(1988)  measured  the  wing  lengths  of  55  male  Drosophila  melanogaster  that 
were  captured  while  mating  and  55  nonmating  males  that  were  caught  by 
random-sweep  netting.  The  mean  wing  lengths  were  1.418  ± 0.012  mm  and 
1.372  ± 0.013  mm  respectively,  indicating  that  the  mating  males  had  signifi- 
cantly larger  wings.  Careful  laboratory  studies  have  verified  the  mating 
advantage  of  larger-winged  males  (Partridge  et  al.  1995;  Wilkinson  1987). 

Turelli  (1988)  emphasized  that  long-term  predictions  of  responses  to  nat- 
ural selection  depend  critically  on  the  stability  of  the  genetic  variance-covari- 
ance matrix,  defined  as  a square  matrix  whose  diagonal  elements  are 
additive  genetic  variances  of  the  traits,  and  the  off-diagonal  elements  are 
additive  genetic  covariances.  Without  direct  information  on  the  stability  of 
genetic  variance-covariance  matrix,  G,  it  is  difficult  to  assess  the  confidence 
in  estimates  of  selection.  Wilkinson  et  al.  (1990)  calculated  empirical  esti- 
mates of  G,  the  genetic  variance-covariance  matrix,  in  Drosophila  popula- 
tions selected  for  23  generations  on  thorax  length.  Statistical  tests  of  the 
constancy  of  G found  that  selection  for  large  thorax  did  not  change  G from 
the  control  population,  but  selection  for  small  thorax  did  result  in  significant 
changes  in  G (Shaw  et  al.  1995).  The  change  may  be  in  part  due  to  gene  fre- 
quency changes  caused  by  selection,  changes  in  gene  frequency  due  to  ran- 
dom drift,  and  changes  in  linkage  disequilibrium  (the  Bulmer  effect).  Given 
that  significant  changes  in  G can  be  introduced  by  very  short-term  laborato- 
ry experiments,  long-term  evolutionary  projections  based  on  current  patterns 
of  genetic  variance  and  covariance  are  probably  questionable. 

EVOLUTION  OF  QUANTITATIVE  TRAITS 

Evolutionary  quantitative  genetics  is  an  application  of  population  genetics 
in  which  the  distributions  of  phenotypic  variation  and  covariation  and  the 
changes  in  those  distributions  are  modeled  by  considering  the  changes  in  the 
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underlying  genes.  The  genetic  models  are  intrinsically  multilocus;  this 
would  make  them  very  cumbersome,  except  that  generally  simplifying 
assumptions  are  made.  This  is  a challenging  field  both  experimentally  and 
theoretically,  and  it  is  an  area  of  very  active  research  at  present.  In  this  sec- 
tion some  of  the  basic  principles  and  conclusions  of  evolutionary  quantita- 
tive genetics  will  be  considered. 

Random  Genetic  Drift  and  Phenotypic  Evolution 

Genetic  variation  in  a finite  population  always  undergoes  the  process  of  ran- 
dom genetic  drift.  If  the  genetic  variation  impacts  a quantitative  character, 
then  the  variance  in  the  quantitative  character  will  in  turn  be  affected  by  ran- 
dom drift.  In  a discrete  population  of  effective  size  Ne,  the  genetic  variance 
changes  over  successive  discrete  generations  according  to 

E(a;)'  = [1  - 1 /2Ne]  E(c«)  + al  9.46 

where  07,,  is  the  increment  in  genetic  variance  added  each  generation  by 
mutation  (Clayton  and  Robertson  1955, 1957;  Lande  1979, 1980;  Turelli  et  al. 
1988).  When  the  influx  of  new  mutations  (inflating  the  variance)  is  balanced 
by  the  loss  of  variance  due  to  random  genetic  drift,  the  population  arrives  at 
the  mutation-drift  equilibrium.  The  expected  genetic  variance  at  the  muta- 
tion-drift equilibrium  is 

E(cjfl)  = 2 Neo2m  9.47 

As  in  the  case  of  random  genetic  drift  for  one  gene,  the  population  is 
expected  to  arrive  at  the  mutation-drift  equilibrium  after  4 Ne  generations. 
Moreover,  given  two  lineages  that  diverged  t generations  ago,  the  expected 
difference  between  the  mean  phenotypes  is  2 to}„.  Just  as  the  neutral  mutation 
theory  predicts  that  the  rate  of  gene  substitution  (and  hence  divergence)  is 
independent  of  population  size  and  depends  only  on  the  neutral  mutation 
rate,  the  neutral  rate  of  phenotypic  divergence  depends  only  on  the  rate  of 
mutations  affecting  the  phenotype.  This  result  was  derived  rigorously  by 
Lynch  and  Hill  (1986),  who  also  showed  that  the  rate  of  divergence  is  depen- 
dent only  on  the  rate  of  purely  additive  mutations,  and  is  independent 
of  dominance  and  epistatic  effects.  Figure  9.26  shows  the  increase  in  with- 
in-population  variance  (from  an  initial  population  with  no  variation),  until 
the  population  reaches  a steady-state  balance  of  mutation  and  drift.  Simula- 
tions also  verify  the  nearly  linear  increase  in  variance  between  populations 
with  time.  Extending  neutral  divergence  to  the  case  of  multiple  characters. 
Lynch  and  Hill  (1986)  showed  that  the  neutral  divergence  of  the 
variance-covariance  matrix  depends  only  on  the  mutational  variance-covari- 
ance matrix.  In  a model  with  n genes,  k alleles  at  each,  population  size  N,  and 
mutation  rate  p per  locus,  Cockerham  and  Tachida  (1987)  found  that  the  ini- 
tial rate  of  increase  of  variance  between  populations  depended  on  N,  but  the 
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Figure  9.26  Simulations  of  a model  with  mutation  of  genes  underlying  a 
quantitative  character  and  random  genetic  drift  in  a finite,  subdivided  popula- 
tion. The  jagged  lines  represent  the  simulations,  while  the  smooth  curves  are 
based  on  an  analytical  model.  One  hundred  populations  were  run  in  each  sam- 
ple and  each  case.  A mutation  rate  of  0.001  per  locus  per  generation  was  used  in 
all  runs.  (Top)  Effective  population  size  = 2,  number  of  loci  = 50.  (Middle)  Popu- 
lation size  = 10  with  10  loci.  (Bottom)  Population  size  = 10  with  50  loci.  The 
steady-state  within  population  variance  increases  with  population  size,  but  the 
rate  of  increase  of  variance  between  populations  is  twice  the  mutational  variance. 


asymptotic  rate  of  divergence,  and  the  steady-state  variance  between  popu- 
lations, depended  only  on  the  mutation  rates. 

Turelli  et  al.  (1988)  used  the  mutation-drift  equilibrium  as  a null  hypoth- 
esis for  devising  a statistical  test.  Evolution  has  been  too  rapid  to  be 
explained  by  the  neutral  model  (with  95%  confidence)  if 

.(Az/g,,)2 

<yj  2f(i.96)2  948 

where  of,  is  the  mutational  variance,  of  is  the  phenotypic  variance,  and  A z is 
the  change  in  phenotypic  mean  in  time  period  f.  In  other  words,  the 
mutation-drift  equilibrium  places  a lower  bound  on  the  ratio  of  mutational 
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variance  to  the  total  phenotypic  variance.  If  the  observed  of,  /of,  is  smaller 
than  the  critical  value  given  in  Equation  9.48,  then  mutation  does  not  intro- 
duce enough  variance  to  account  for  the  observed  divergence.  The  test  in 
Equation  9.48  is  only  useful  on  a macroevolutionary  time  scale,  when  there 
is  some  hope  that  the  diverged  populations  actually  have  reached  a muta- 
tion-drift equilibrium.  For  situations  like  Bumpus's  sparrows  or  the  influ- 
ence of  the  rains  caused  by  El  Nino  on  Darwin's  finches,  this  test  is  not 
appropriate,  and  tests  that  compare  the  magnitude  of  change  to  that  expect- 
ed by  random  drift  in  a population  with  the  given  effective  size  are  used 
(Lande  1976, 1977). 

The  amount  of  variance  in  quantitative  traits  introduced  by  mutation  each 
generation  is  a quantity  of  considerable  interest.  Mutational  variance  can  be 
estimated  by  two  methods,  including  quantification  of  the  increase  in  variance 
among  initially  identical  lines  as  mutations  accumulate  over  generations,  or 
by  response  to  selection  in  a population  that  is  initially  in  mutation-drift  equi- 
librium or  is  devoid  of  variation  (Lynch  1994).  It  might  be  supposed  that  an 
artificial  selection  experiment  in  a population  lacking  genetic  variation  would 
be  doomed  to  failure,  but  populations  accumulate  sufficient  mutational  varia- 
tion to  produce  a response  in  a few  generations  (Fry  et  al.  1995).  In  both  cases, 
the  action  of  natural  selection  will  bias  the  estimates  of  mutational  variance,  so 
care  is  taken  to  try  to  stop  natural  selection.  In  mutation  accumulation  exper- 
iments in  Drosophila,  this  has  traditionally  been  done  with  balancer  chromo- 
somes that  prevent  recombination  and  minimize  selection  by  maintaining 
chromosomes  in  a heterozygous  state.  Lynch  (1988)  and  Houle  et  al.  (1996) 
did  comprehensive  reviews  of  experimental  estimates  of  off,  /at2and  found 
that  this  ratio  generally  falls  in  the  range  of  10  2 to  10  4.  On  an  absolute  scale, 
the  number  of  deleterious  mutations  per  gamete  per  generation  is  very  close 
to  1.0.  Interestingly,  the  estimate  of  of /of, , the  ratio  of  genetic  variance  to  the 
mutational  variance,  is  expected  to  equal  the  median  persistence  time  of 
mutant  alleles.  The  estimated  persistence  times  for  traits  associated  with  life 
history  averaged  around  50  generations,  while  the  persistence  time  of  mor- 
phological trait  mutations  was  around  100  generations.  In  order  to  see 
whether  this  figure  is  reasonable,  we  need  to  consider  what  the  models  of 
mutation-selection  balance  would  predict. 

One  experiment  to  quantify  mutational  variance  that  gave  an  estimate 
orders  of  magnitude  different  from  those  mentioned  above  was  a study  of  E. 
coli  mutation-accumulation  lines  (Kibota  and  Lynch  1996).  The  experiment 
was  done  by  expanding  50  independent  lineages  of  cells  from  a single  cell, 
and  following  each  of  the  50  lineages  for  300  growth  cycles  (each  of  about  25 
generations).  The  per-cell  rate  of  deleterious  mutation  was  found  to  be 
0.0002,  in  contrast  to  the  per-individual  deleterious  mutation  rate  in  Drosophi- 
la of  about  1.0  per  generation.  The  discrepancy  is  probably  in  part  caused  by 
the  smaller  genome  size  of  E.  coli  (about  y35  that  of  Drosophila),  and  the  fact 
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that  Diosophila  undergo  about  25  cell  divisions  per  organismal  generation.  In 
fact,  when  the  E.  coli  mutation  rate  is  scaled  up  by  these  two  factors,  the  effec- 
tive rate  of  0.21  mutations  per  generation  is  remarkably  close  to  that  of 
Drosophila. 

Artificial  selection  experiments  can  provide  information  about  rates  of 
accumulation  of  variance  through  mutation.  The  continued  response  seen  in 
the  Illinois  corn  oil  experiment  in  Figure  9.4  is  almost  certainly  due  in  large 
part  to  selection  operating  on  variation  introduced  subsequent  to  the  start  of 
the  study.  Enfield  (1980)  obtained  similarly  large  response  to  selection  on  Tri- 
bolium  pupal  weight  and  argued  that  mutational  variance  was  at  least  part  of 
the  cause.  In  comparing  selection  response  in  large  and  small  populations, 
Weber  (1990a,b)  and  Weber  and  Diggins  (1990b)  were  trying  to  quantify  the 
importance  of  drift  in  artificial  selection  experiments.  The  striking  response 
they  obtained  for  wing  tip  height  and  alcohol  vapor  tolerance,  and  the 
increased  response  in  larger  populations,  suggested  that  larger  populations 
are  better  able  to  produce  the  full  range  of  genetic  variation  for  the  trait,  in 
part  perhaps  by  mutation,  but  also  by  recombination.  The  importance  of  hav- 
ing the  right  combinations  of  genes  for  selection  response  to  occur  was  very 
clear  in  a selection  experiment  for  flight  speed  (Weber  1996).  One  hundred 
generations  produced  a nearly  linear  increase  in  flight  performance.  At  that 
point,  selected  and  control  flies  were  crossed,  and  the  Fj  flies  did  little  better 
than  the  controls.  Subsequent  selection  on  these  hybrids  and  their  descen- 
dants recovered  performance  in  just  six  generations  that  had  taken  75  gener- 
ations to  achieve.  It  is  likely  that  the  secondary  selection  went  so  much  faster 
because  the  Fj  flies  had  preassembled  clusters  of  alleles  conferring  stronger 
flight,  so  response  did  not  require  waiting  for  rare  recombinants  to  construct 
them. 

Transposable  elements  are  one  source  of  mutations  that  have  received 
particular  attention  by  evolutionary  quantitative  geneticists.  Because  trans- 
posable elements  may  jump  with  appreciable  rates  in  some  species,  it  was 
thought  that  they  may  contribute  a substantial  amount  of  mutational  varia- 
tion. By  performing  artificial  selection  in  the  presence  of  transposing  P-ele- 
ments  and  controls  in  the  absence  of  P-elements,  Mackay  (1985)  obtained  the 
first  evidence  in  Drosophila  that  P-elements  may  contribute  genetic  variance 
that  can  result  in  increased  selection  response  (see  Problem  9.2).  The  joint 
effects  of  P-element  insertions  on  bristle  traits  and  on  viability  were  quanti- 
fied by  Lyman  et  al.  (1996)  in  a sample  of  1094  single-P-element  insertion 
lines.  The  magnitude  of  effects  exceeded  those  seen  in  mutation  accumula- 
tion experiments  for  spontaneous  mutations.  They  found  that  most  of  the 
variance  in  bristle  number  was  caused  by  a few  insertions  of  relatively  large 
effect,  and  that  the  elements  with  the  largest  phenotypic  effect  were  also  the 
most  deleterious.  In  another  study  of  random  P-element  insertion  lines 
scored  for  14  metabolic  characters,  single  insertions  were  found  to  have  sig- 
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nificant  effects  on  several  metabolic  traits  at  once,  indicating  substantial  lev- 
els of  pleiotropy  (Clark  et  al.  1995b).  Insertional  mutations  are  not  limited  to 
Drosophila,  and  Keightley  et  al.  (1993)  have  demonstrated  that  retroviral 
insertions  clearly  generate  additional  genetic  variation  in  body  weight  in 
mice.  These  experiments  demonstrate  that  transposable  and  retroviral  ele- 
ments may  be  a significant  source  of  quantitative  genetic  variation  in  natural 
populations. 

Mutation-Selection  Balance 

Stabilizing  selection  is  of  great  interest  for  the  role  it  may  play  in  the  persis- 
tence of  heritable  genetic  variation  in  multifactorial  traits.  Intuitively,  it  seems 
reasonable  to  suppose  that  observed  levels  of  additive  genetic  variation  may 
result  from  a balance  between  stabilizing  selection,  which  tends  to  reduce 
genetic  variation,  and  new  mutations,  which  tend  to  increase  it.  Deceptively 
simple  when  stated  verbally,  the  models  become  very  complex  when  formu- 
lated in  mathematical  terms,  and  include  such  complications  as  the  number 
of  genes,  the  type  of  action  of  alleles  and  their  interactions,  linkage  between 
genes,  the  type  and  intensity  of  selection,  and  the  influence  of  selection  on 
other  traits  that  are  related  through  pleiotropy.  Pleiotropy  is  reflected  in  the 
widespread  tendency  of  genes  to  affect  several  traits  simultaneously,  which 
usually  results  from  the  fact  that  complex  phenotypic  traits  are  determined  by 
the  interactions  of  the  products  of  many  genes  during  development.  The 
number  of  genes  affecting  a trait  is  relevant  to  mutation-selection  balance 
because,  for  a given  genetic  variance,  the  selection  intensity  per  locus  decreas- 
es as  the  number  of  loci  increases.  If  the  total  mutation  rate  is  fixed,  the 
per-locus  mutation  rate  must  decrease  as  the  number  of  loci  increases.  The 
challenge  is  to  develop  some  theory  that  will  describe  the  relationship 
between  equilibrium  additive  genetic  variance  in  a population  having  stabi- 
lizing selection  balanced  against  mutation.  The  parameters  will  include  the 
effective  population  size,  the  mutation  rate  per  locus,  the  number  of  loci 
affecting  the  trait,  the  distribution  of  mutational  effects,  and  the  distribution 
of  fitness  effects  of  new  mutations. 

As  you  might  have  surmised,  progress  in  understanding  mutation-selec- 
tion balance  in  such  a complex  system  has  come  from  making  many  simpli- 
fying assumptions.  Following  Kimura  (1965),  Lande  (1975),  and  Turelli 
(1984),  let  p,(x)  be  the  distribution  of  allelic  effects  before  selection  in  genera- 
tion t.  In  each  generation,  selection  occurs,  then  mutation,  and  then  repro- 
duction. Assume  that  phenotypes  and  underlying  genotypic  effects  have  a 
Gaussian  distribution;  since  selection  would  be  likely  to  change  the  distribu- 
tion, this  approximation  remains  acceptable  only  under  weak  selection. 
Although  the  model  as  developed  is  haploid,  it  is  intended  as  an  approxima- 
tion for  a diploid  model  assuming  additivity  of  gene  effects.  Letting  pt'(x)  be 
the  density  of  allelic  effects  after  selection,  p be  the  mutation  rate  (the  fraction 
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of  gametes  that  are  mutated),  and  g(x)  be  the  density  of  mutational  effects  on 
the  phenotype,  the  recursion  for  p(x)  is 

Pm(x)  = (1  - [i)p't(x)  + pj~j',(y)g(x  - y)dy  9.49 

This  model,  called  the  Kimura-Lande-Fleming  model  by  Turelli  (1984), 
partitions  the  phenotypic  distribution  into  two  components.  A fraction 
(1  - p)  of  the  distribution  derives  from  nonmutant  genes,  and  their  distribu- 
tion is  pi'(x)  after  selection.  Fitness  follows  a Gaussian  distribution  with  an 
optimum  phenotype  coinciding  with  the  mean.  Genes  that  mutate  change 
phenotype  x to  y with  distribution  g(x  - y),  and  then  undergo  selection, 
resulting  in  the  distribution  p,'(y).  Since  any  phenotype  x can  mutate  to  many 
different  values  of  y,  the  integral  is  taken  over  all  values  of  y.  Kimura  (1965) 
showed  that,  with  weak  selection,  the  process  in  Equation  9.49  comes  to  a 
steady  state  with  a Gaussian  distribution  of  allelic  effects.  This  model  leads  to 
a paradox.  On  the  one  hand,  Lande  (1975)  found  an  approximate  Gaussian 
equilibrium  that  held  in  the  case  of  small  mutational  effects  but  required 
unacceptable  high  mutation  rates.  On  the  other  hand,  Turelli  (1984)  intro- 
duced the  "house-of-cards"  view  of  mutation  to  this  model;  it  allows  indi- 
vidual mutations  to  have  large  effects  so  that  the  postmutation  distribution  of 
allelic  effects  is  g{y),  which  is  independent  of  x,  the  premutation  effect  of  an 
allele.  Both  models  could  give  reasonable  equilibrium  distributions  of  allelic 
effects,  but  under  vastly  different  assumptions  about  numbers  of  loci  affect- 
ing traits,  mutation  rates,  and  magnitudes  of  mutational  effects.  Where  in  this 
complex  space  of  parameters  does  reality  lie?  It  became  apparent  that  only 
empirical  observations  would  settle  the  matter. 


PROBLEM  9.11  Why  does  the  problem  of  mutation-selection 
balance  for  quantitative  characters  depend  on  the  number  of  loci  that 
determine  a trait? 


ANSWER  The  problem  of  mutation-selection  balance  for  quantita- 
tive characters  depends  on  the  number  of  loci  that  determine  a trait  in 
two  ways.  First,  as  the  number  of  loci  increases,  the  per-locus  selec- 
tion coefficients  must  become  smaller,  and  so  the  per-locus  selective 
effects  may  fall  below  1 /2 N with  sufficiently  many  loci.  Second,  as  the 
number  of  loci  increases  with  a fixed  mutation  rate  per  locus,  the 
effective  number  of  mutations  influencing  the  trait  increases. 
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As  reviewed  by  Barton  and  Turelli  (1989),  there  is  also  a paradox  in  the 
results  of  models  and  experiments  to  determine  mutational  variance. 
Per-trait  mutation  rates  from  mutation-accumulation  experiments  are  on  the 
order  of  10~2  or  10  ',  and  per  locus  mutation  rates  from  classical  genetic  stud- 
ies are  on  the  order  of  10  6.  Logic  implies  that  these  two  observations  are  com- 
patible if  each  trait  is  determined  by  100  to  1000  genes.  Yet  estimates  of 
classical  effective  numbers  of  genes  affecting  traits  typically  give  a range  of  5 
to  20  genes.  As  is  often  the  case  in  science,  such  paradoxes  may  generally  best 
be  resolved  by  breaking  out  of  the  standard  way  of  thinking  about  the  prob- 
lem and  taking  an  altogether  different  course.  The  course  that  is  already 
changing  our  views  of  the  genetic  basis  of  quantitative  traits,  namely  direct 
mapping  of  segments  of  the  genome  that  affect  the  traits,  is  covered  in  the 
next  section. 


QUANTITATIVE  TRAIT  LOCI 

Despite  the  landmark  paper  of  Fisher  (1918)  and  the  impressive  work  since, 
there  remains  a gap  between  the  theory  of  quantitative  traits  and  the  identi- 
fication of  classical  Mendelian  genes.  The  good  fit  of  molecular  variation  to 
many  aspects  of  the  neutral  theory  contrasts  strongly  with  the  appeal  of 
adaptation  in  morphological  evolution.  Although  the  discrepancy  is  no 
doubt  exaggerated  by  intuitive  notions  that  something  as  obvious  as  horns 
must  be  useful,  and  something  as  subtle  as  silent  nucleotide  substitutions 
must  have  negligible  effects,  it  remains  an  important  problem  to  bring  these 
two  levels  of  population  genetics  together.  One  approach  is  to  identify  the 
explicit  relations  among  Mendelian  genes  and  quantitative  traits.  Let  us  see 
how  this  is  being  done. 

Mapping  Genes  that  Influence  Quantitative  Characters 

Largely  because  of  the  density  of  markers  identified  by  a variety  of  molecular 
methods  there  has  been  a renewed  interest  in  mapping  genes  (or  blocks  of 
genes)  that  affect  quantitative  traits.  Observations  of  linkage  between  a marker 
gene  and  a gene  that  influences  a quantitative  trait  date  back  to  Sax  (1923). 
Interest  in  actually  mapping  "quantitative  trait  loci"  (or  QTLs)  also  has  a long 
history,  beginning  with  Thoday's  (1961)  studies  of  bristle  traits  in  Drosophila. 
Reviews  of  the  early  work  can  be  found  in  Thoday  (1979).  The  essence  of  the 
experiments  on  Drosophila  was  to  construct  lines  that  differed  with  respect  to  a 
quantitative  character  (such  as  "up"  and  "down"  selected  lines  for  bristle  num- 
ber), and  to  construct  a random  set  of  recombinants  among  them,  using  multi- 
ply marked  chromosomes  (each  bearing  several  recessive  mutations  whose 
phenotype  can  be  readily  scored),  to  infer  which  chromosomal  regions  the 
recombinants  carried.  The  end  result  of  such  experiments  is  not  a genetic  map 
in  the  classical  sense,  identifying  locations  of  a series  of  specific  genes  that 
determine  the  quantitative  character.  The  result  is  rather  a statistical  description 


468 


Chapter  9 


of  the  character  that  indicates  what  fraction  of  the  genetic  variance  in  the 
parental  lines  is  contributed  by  each  region  identified  by  the  flanking  markers. 
When  this  method  is  taken  to  a fine  enough  genetic  scale,  it  is  possible  to  iden- 
tify individual  genes  that  affect  a trait,  and  the  application  of  dense  sets  of  mol- 
ecular markers  motivated  the  original  drive  to  look  again  at  a fine  genetic  scale 
(Paterson  et  al.  1988;  Lander  and  Botstein  1989;  Tanksley  1993). 

The  simplest  means  for  mapping  QTLs  is  to  consider  a cross  between  two 
parental  lines  that  differ  widely  in  phenotype.  If  the  parental  lines  are  fixed 
for  alternate  alleles  at  many  loci,  the  F]  hybrids  will  be  heterozygous  at  those 
loci.  Intercrossing  the  Ft  then  produces  a highly  variable  F2  population,  both 
in  terms  of  the  underlying  genotypes  and  in  the  resulting  distributions  of 
phenotypes.  Genes  that  are  closely  linked  may  be  represented  in  only  one 
linkage  phase  in  the  F2,  so  one  limitation  of  this  method  is  that  it  does  not 
allow  very  fine  resolution  of  genes,  but  it  is  a good  place  to  start. 

Suppose  we  have  constructed  an  F2  population  as  described  above.  This  F2 
population  is  scored  for  phenotypes  and  a series  of  molecular  markers  that 
differed  between  the  original  parental  lines.  First  consider  the  case  in  which 
a molecular  marker  indicates  the  allelic  state  of  a gene  that  directly  affects  a 
quantitative  trait.  In  order  to  quantify  the  effects  of  the  gene  on  the  trait, 
regression  methods  can  be  performed  as  illustrated  in  Figure  9.27.  The 
assumption  in  doing  regression  is  that  this  locus  has  an  effect  that  will  be 
apparent  when  averaging  across  all  genotypes  at  the  other  loci.  Such  margin- 
al effects  are  expected  if  the  genes  act  additively  across  loci,  but  if  they  inter- 
act, the  marginal  effects  may  either  underestimate  or  overestimate  the 
importance  of  a particular  gene  in  determining  the  trait.  If  the  alleles  are 
labeled  A1  and  A2,  and  genotypes  are  indexed  by  the  number  of  copies  of  A2 
alleles  they  possess,  then  regression  of  the  phenotype  on  this  index  will  pro- 
duce a regression  coefficient  that  estimates  the  value  of  the  additive  effect  a 
(Figure  9.27).  Similarly,  after  indexing  the  homozygotes  by  0 and  heterozy- 
gotes by  1,  the  regression  of  the  phenotype  on  this  index  produces  a regres- 
sion coefficient  that  estimates  the  dominance  parameter,  d.  One  measure  of 
the  significance  of  each  marker  on  the  phenotype  might  be  simply  to  deter- 
mine the  statistical  significance  of  these  regressions  in  the  standard  way.  As 
we  will  see,  this  is  not  the  best  approach. 

A shortcoming  of  the  above  method  is  that  it  assumes  that  the  markers 
themselves  are  affecting  the  quantitative  trait.  One  improvement  is  to  sup- 
pose that  there  is  a quantitative  trait  locus  (or  QTL)  that  is  not  directly 
observed,  but  that  lies  between  a pair  of  markers  that  are  scored  (Figure  9.28). 
The  idea  is  to  still  use  a regression  model  to  relate  genotypes  at  the  QTL  to 
observed  phenotypes,  but  now  the  genotypes  are  not  directly  observed. 
Instead,  we  infer  the  probability  that  each  individual  has  a genotype  at  the 
QTL  from  the  genotypes  at  the  flanking  markers.  For  example,  if  the  markers 
are  AjA,  B1BV  the  QTL  genotype  is  probably  QQ,  but  a double  recombination 
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Figure  9.27  Illustration  of  the  use  of  regression  to  estimate  parameters  of  QTL 
expression.  Ignoring  recombination  for  now,  the  genotypes  AA,  AA'  and  A' A' 
are  indexed  -1,0,  and  1,  and  regression  of  phenotypes  in  these  indices  yields  an 
estimate  of  the  additive  effect  of  the  locus.  Indices  0, 1,  and  0 for  genotypes  AA, 
AA'  and  A' A'  yield  regression  estimates  of  dominance  effects. 


in  either  gamete  will  result  in  genotype  Qq,  and  a double  recombination 
in  both  gametes  will  result  in  qq.  The  probabilities  of  these  three  events  are 
1 _ r\  ~ r2  + r i r2,  rx  + r2-  2rxr2,  and  r\r2,  respectively.  One  can  work  out  the 
probabilities  of  all  genotypes  by  using  the  gamete  frequencies  produced  by 
the  Fj  individuals  as  listed  in  Figure  9.28  to  build  an  8 x 8 Punnett  square.  For 
each  position  of  the  QTL  on  the  map,  one  fits  the  model 

^ ~ P Pi  (&ij  *ai  T + Sij  %di  1 + £ 
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P(QIA1B1)  = (l-r1)(l-r2)/* 

P((?l  A1Bl)  = r1r/x 
P(Q  1/4^2)  = (1  -r1)r2/y 
P(q\AxB2)  = fj(l  -r2)/y 
P(QU2B,)  = r2(l  - r2)/y 
P(<?  I a2bi)  = (1  - r{)r2/y 
P(Q  I A2B2)  = rxr2/x 
P(q  I A2B2)  = (1  - JjXI  - r2)/ x 
where  x = l-r1-r2  + 2rxr2 
and  y = rx  + r2  - 2rxr2 

Figure  9.28  Composite  interval  mapping  for  quantitative  trait  loci  is  done  by 
first  expressing  the  probability  that  each  marker  locus  genotype  has  a given 
QTL  genotype.  The  flanking  markers  are  A and  B with  the  QTL  in  the  middle, 
and  recombination  frequencies  as  specified.  See  text  for  a description  of  how  the 
position  of  the  QTL  is  determined. 


This  is  a multiple  regression  with  phenotype  Y expressed  as  a linear  func- 
tion of  grand  mean  p,  additive  terms  a„  dominance  terms  d„  and  error  e.  The 
xai  are  indicator  variables  with  values  — 1,  0,  and  1 for  genotypes  QQ,  Qq,  and 
qq,  and  xdi  represents  the  indicator  variables  for  dominance  terms,  with  xdl  = 
0, 1, 0 for  QQ,  Qq,  and  qq.  Because  the  QTL  genotype  is  not  actually  observed, 
the  probability  that  marker  genotype  j had  QTL  i (g,,)  is  determined  from  the 
markers  as  described  above.  In  practice,  the  goodness  of  fit  of  this  model  is 
assessed  by  a likelihood  ratio,  and  the  position  of  the  QTL  in  the  genome  that 
gives  the  maximum  likelihood  is  the  most  likely  location  for  the  QTL.  Most 
often  there  is  more  than  one  peak  to  the  likelihood  curve,  and  this  may  very 
well  reflect  the  presence  of  more  than  one  QTL  in  the  genome. 

Significance  Testing  of  QTLs 

A serious  problem  with  QTL  mapping  methods  is  how  to  decide  when  a like- 
lihood ratio  is  significant.  In  the  usual  way  one  does  a hypothesis  test  in  sta- 
tistics, one  gets  a test  statistic  whose  distribution  under  the  null  hypothesis 
one  knows.  The  null  hypothesis  is  rejected  if  it  has  a probability  less  than  5%. 
This  means  one  expects  to  reject  the  null  hypothesis  5%  of  the  time,  even  if 
it  is  true.  With  QTL  mapping,  one  essentially  tests  the  null  hypothesis  for 
thousands  of  potential  locations  of  the  QTL,  so  on  the  face  of  it,  false  posi- 
tives would  litter  the  genome.  The  problem  is  made  even  worse  by  the  fact 
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that  we  do  not  know  precisely  what  the  null  distribution  of  the  likelihood 
ratio  test  is,  and  the  many  tests  that  are  done  are  not  all  independent  of  one 
another. 

Fortunately  there  is  a way  out  of  this  morass.  Rather  than  relying  on 
asymptotic  theory  to  get  an  expected  null  distribution,  we  can  build  an 
empirical  null  distribution  for  the  likelihood  ratio  by  randomly  permuting 
(shuffling)  the  association  of  marker  genotypes  and  phenotypes  (Doerge  and 
Churchill  1996).  Just  as  the  Hudson  (1992)  test  of  geographic  structure  gave  a 
null  distribution  by  randomly  permuting  geographic  location,  this  approach 
uses  all  the  built-in  structure  of  the  data  (including  variations  of  sample  size, 
clustered  distribution  of  markers,  and  so  on)  to  generate  a null  distribution 
tailored  to  the  data.  When  the  observed  likelihood  ratios  are  tested  against 
this  empirical  null,  then  most  of  the  problems  outlined  above  disappear. 

Composite  Interval  Mapping  and  Other  Refinements 

The  method  of  interval  mapping  as  described  above  examines  only  the  pair  of 
markers  flanking  the  putative  QTL  position  in  assessing  a probability  that  a 
particular  marker  genotype  bears  a particular  pair  of  QTL  alleles.  It  also  con- 
siders the  phenotype  to  be  determined  in  a strictly  additive  fashion,  so  that 
the  marginal  effect  of  a particular  QTL  on  the  phenotype  is  all  that  is  needed 
to  map  the  QTL.  More  recently,  these  assumptions  have  been  relaxed  in  the 
method  of  composite  interval  mapping  (Zeng  1994;  Jansen  and  Stam  1994). 
This  approach  is  essentially  a multiple  regression  extension  of  standard  inter- 
val mapping  allowing  genotypes  at  marker  loci  more  distant  from  the  puta- 
tive QTL  site  to  have  an  effect  on  the  phenotype.  This  is  done  by  a statistical 
approach  called  partial  regression,  and  in  principle  it  allows  there  to  be  some 
level  of  epistasis  between  sets  of  QTLs  in  the  way  they  affect  a trait.  Both  sim- 
ulations and  practical  application  of  these  methods  are  very  encouraging. 

One  limitation  of  using  F2  populations  is  that  blocks  of  the  genome  will 
remain  intact.  Composite  interval  mapping  algorithms  have  been  generalized 
to  include  the  crossing  designs  where  the  F2  are  intercrossed  to  give  an  F3, 
which  are  then  intercrossed,  and  so  forth.  After  a few  generations  like  this, 
the  mixed  population  is  then  backcrossed  to  the  two  parental  populations, 
and  phenotypes  and  marker  genotypes  are  assessed.  Such  a design  allows 
much  finer  resolution  because  several  rounds  of  genetic  recombination  have 
occurred.  Even  finer  mapping  can  be  done  by  considering  "historical"  recom- 
bination in  even  longer-running  experiments  (Xiong  and  Guo  1997).  Other 
methods  for  assessing  relationships  between  complex  phenotypes  and  geno- 
types include  nonparametric  methods  (Kruglyak  and  Lander  1995a),  and,  if 
candidate  genes  have  been  identified,  one  can  build  a gene  tree  for  each  can- 
didate gene  and  contrasts  of  phenotypes  among  clades  of  the  gene  tree  may 
be  able  to  distinguish  subtle  differences  in  phenotype  (Templeton  et  al.  1995). 
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PROBLEM  9. 1 2 The  third  chromosome  data  of  Long  et  al.  (1995) 
consisted  of  estimates  of  line  means  of  sternopleural  and  abdominal 
bristles  of  84  lines  that  were  also  scored  for  the  presence  of  roo  trans- 
posable  elements  at  29  positions  along  the  chromosome.  Below  is  a 
subset  of  the  data,  where  H means  that  site  on  the  genome  had  the 
same  roo  genotype  as  the  high  selected  line  and  L means  the  line  had 
the  same  genotype  as  the  low  selected  line.  Determine  whether  each 
interval  has  an  effect  by  a simple  f-test.  The  t statistic  is 

i±+- 

\n2  «i 

where  s?  and  sf  are  the  sample  variance  of  the  H and  L groups  and  «i 
and  n2  are  the  sample  sizes  of  the  H and  L groups  respectively.  There 
are  nx  + n2-l  degrees  of  freedom  to  the  f-test,  and  the  5%  significance 
level  for  fn  is  2.201. 
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ANSWER  The  mean  of  the  lines  with  an  H in  interval  1 is  pi  = 18.94, 
and  the  mean  of  the  lines  with  an  L in  interval  1 is  p2  = 16.34. 
Substituting  into  the  f-statistic  formula,  we  get  t = (18.94  - 16.34)/ 
V(0.279)/7  + 0.520/6)  = 7.33.  This  test  has  13  - 2 = 11  degrees  of  free- 
dom, and  the  corresponding  P value  is  less  than  0.05.  We  conclude 
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that  even  with  this  small  subset  of  the  data,  there  is  a QTL  near  this 
region  that  gives  the  H lines  more  bristles  than  the  L lines.  The  t sta- 
tistics for  the  other  two  intervals  are  both  4.07,  and  are  also  significant. 
With  so  little  data,  we  cannot  tell  whether  the  significant  effects  are  all 
caused  by  the  same  QTL  or  by  more  than  one  QTL. 


Although  the  method  of  scoring  F2  individuals  is  powerful,  it  is  not 
essential  to  perform  controlled  crosses  to  make  inferences  about  the  effects 
of  marker  genes  on  quantitative  characters.  This  is  fortunate  because  meth- 
ods for  identifying  marker  genes  relevant  to  polygenic  human  diseases  are 
thought  to  have  great  potential  in  identifying  underlying  genes  (Lander  and 
Schork  1994).  The  chances  for  success  in  identifying  QTLs  in  humans  is 
much  improved  by  selecting  candidate  genes  which  may  have  a functional 
relation  to  the  trait  (Risch  and  Merikangas  1996).  For  example.  Sing  and 
Davignon  (1985)  examined  a sample  of  humans  for  apolipoprotein  E (apoE) 
genotype  and  several  quantitative  traits  relating  to  serum  triglyceride,  cho- 
lesterol, and  low-density  lipoproteins.  There  are  three  common  alleles  of 
apoE,  yielding  six  genotypes,  the  mean  phenotypes  of  which  are  presented 
in  Table  9.8.  About  16%  of  the  total  genetic  variance  in  LDL-C  (an  important 
carrier  of  cholesterol)  is  explained  by  genotype  at  the  apoE  locus.  The  impor- 
tance of  this  kind  of  approach  is  underscored  by  the  observation  that  about 
half  of  the  variation  among  individuals  in  serum  cholesterol  is  associated 
with  polygenic  variation. 

What  Have  We  Learned  from  Mapping  QTLs? 

One  might  think  that  by  mapping  QTLs,  geneticists  would  immediately 
learn  where  the  genes  are  that  affect  a character  and  then  be  able  to  identify 
those  genes  and  quickly  isolate  them.  This  has  not  been  the  case.  Most  QTL 
mapping  projects  have  not  been  followed  to  this  point.  Nevertheless,  it  is 
useful  to  consider  what  has  been  learned  from  the  patterns  of  QTL  effects 
observed.  One  of  the  largest  QTL  mapping  studies  was  that  of  Stuber  et  al. 
(1992),  which  sought  to  determine  the  genetic  basis  for  heterosis  or  hybrid 
vigor  in  maize.  The  primary  competing  hypotheses  were:  (1)  hybrid  vigor 
arises  from  an  advantage  that  comes  from  being  heterozygous  for  individu- 
ally important  genes,  each  of  which  has  a heterozygote  advantage;  (2)  het- 
erozygous genotypes  for  each  individual  locus  are  intermediate,  but  both 
parentals  have  low  performance  because  of  homozygosity  for  different  reces- 
sive deleterious  alleles.  Stuber  et  al.  found  that  many  regions  of  the  genome 
had  the  property  that  QTL  genotypes  QQ  and  t]q  were  inferior  to  Qt],  so  it 
appeared  at  first  as  if  there  were  a locus-by-locus  heterozygote  advantage. 
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TABLE  9.8  SERUM  LIPID  LEVELS  IN  A SAMPLE  OF  102  PEOPLE  FROM  OTTAWA, 
ADJUSTED  FORAGE,  SEX,  HEIGHT,  AND  WEIGHT  EFFECTS 


Estimates  of 

Genotype  the  population 

Probability  


e4e4 

e3e3 

e2e2 

e4e3 

e3e2 

e4e2 

(F  ratio) 

Mean 

Variance 

Count 

4 

63 

2 

21 

10 

2 

Relative  frequency 

0.039 

0.618 

0.020 

0.206 

0.098 

0.020 

— 

— 

— 

Variable 

Total  cholesterol 

180.3 

173.8 

136.0 

183.5 

161.4 

178.1 

0.09 

174.16 

732.48 

HDL-C 

53.3 

47.3 

47.1 

47.3 

45.7 

45.12 

0.93 

47.32 

130.43 

LDL-C 

102.9 

104.2 

73.9 

112.8 

89.5 

109.3 

0.08 

104.00 

602.17 

VLDL-C 

24.0 

22.3 

15.0 

23.3 

26.2 

23.5 

0.83 

22.83 

127.56 

Triglycerides 

65.5 

74.4 

70.6 

70.4 

79.3 

73.5 

0.96 

73.60 

918.01 

LDL-Apo  B 

86.3 

83.9 

55.8 

86.8 

78.4 

60.5 

0.13 

83.03 

375.72 

VLDL-Apo  B 

8.5 

11.4 

5.65 

10.5 

9.2 

18.9 

0.56 

10.91 

66.21 

LDL-C /total  cholesterol 

0.56 

0.60 

0.55 

0.61 

0.55 

0.61 

0.26 

0.59 

0.0055 

VLDL-C  /total  triglycerides 

0.37 

0.32 

0.19 

0.33 

0.33 

0.32 

0.79 

0.32 

0.0200 

Source:  From  Sing  and  Davignon  1985. 


Subsequent  work  suggests  that  the  regions  identified  in  this  mapping  effort 
(which  included  phenotypic  measurements  on  nearly  100,000  plants  and 
molecular  assays  done  for  76  loci)  were  still  too  coarse;  the  finer  mapping 
suggests  that  the  QTLs  are  actually  blocks  of  genes  in  linkage  disequilibri- 
um, with  each  gene  tending  toward  a pattern  of  deleterious  recessive  effects. 

A similar  experiment  in  hybrid  rice  initially  produced  results  much  more 
in  line  with  recessive  deleterious  effects  being  responsible  for  the  hybrid 
vigor  (Xiao  et  al.  1995).  Rice  is  predominantly  a self-pollinator,  while  corn 
outcrosses,  so  one  might  have  expected  rice  to  have  more  effectively  elimi- 
nated recessive  deleterious  alleles  from  its  genome.  Subsequent  analysis  by 
Zeng  (pers.  com.)  is  finding  that  both  the  maize  and  the  rice  data  show  evi- 
dence for  large  amounts  of  epistasis. 

This  chapter  began  with  studies  of  bristle  number  in  Drosophila  that  began 
in  the  1950s.  The  early  work  suggested  that  bristle  number  was  a trait  that 
seemed  to  fit  the  classical  quantitative  genetic  model  fairly  well.  Bristle  num- 
ber had  an  abundance  of  additive  variance  and  responded  well  to  artificial 
selection.  QTL  mapping  was  applied  to  bristle  number  by  scoring  roo  ele- 
ments as  the  genetic  marker  in  93  recombinant  isogenic  lines  derived  from 
divergently  selected  parental  lines  (Long  et  al.  1995).  The  roo  transposable 
element  is  present  in  high  copy  number  and  many  sites  differed  between  the 
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two  parental  lines.  Scoring  the  roo  elements  by  in  situ  hybridization  allowed 
mapping  effects  to  within  4 cM;  the  results  revealed  two  X-linked  QTLs  and 
five  third-chromosome  QTLs  that  had  a significant  effect  on  bristle  number. 
All  seven  of  these  locations  happen  to  coincide  with  genes  known  to  have 
large  effects  on  bristle  number,  such  as  achaete-scute,  hairy,  and  Delta. 

As  exemplified  by  the  findings  with  regard  to  bristle  number,  where  there 
are  many  candidate  genes  identified  for  a particular  trait,  it  is  often  more  effi- 
cient to  start  with  the  candidate  genes  and  ask  how  much  of  the  population 
variability  they  explain.  One  other  candidate  gene  for  bristle  number  is 
scabrous,  and  Lai  et  al.  (1995)  did  one  of  the  most  extensive  analyses  of  mole- 
cular variation  in  the  sea  locus  and  its  associations  with  quantitative  varia- 
tion. The  fact  that  a significant  association  between  molecular  variation  of  sea 
and  phenotypic  differences  in  bristle  number  was  found  is  consistent  with 
the  idea  that  variation  at  this  locus  has  an  effect  on  bristle  number  even  when 
averaged  over  effects  of  the  rest  of  the  genome.  Although  it  is  known  that 
outright  loss-of-function  mutations  in  sea  affect  bristle  number,  it  is  quite 
another  matter  to  find  that  quantitative  variation  in  a natural  population  in 
the  expression  of  sea  causes  quantitative  variation  in  bristle  number.  Another 
idea  for  testing  the  role  of  candidate  genes  is  a quantitative  trait  complemen- 
tation test  (Mackay  and  Fry  1996).  By  crossing  lines  selected  for  high  and  low 
bristle  number  to  wildtype  control  lines  and  to  mutant  lines  having  defects  in 
known  bristle  genes,  large  differences  in  bristle  traits  are  sometimes  found. 
The  differences  were  interpreted  to  mean  either  that  the  selected  lines  had 
allelic  differences  at  the  candidate  genes  or  that  they  had  epistatic  interac- 
tions with  the  candidate  genes. 

The  scale  at  which  we  look  at  the  genome  often  colors  the  way  problems 
are  perceived.  The  Stuber  et  al.  results  on  hybrid  vigor  in  maize  appeared  to 
show  heterosis  of  QTLs  at  one  scale  but  recessiveness  at  a finer  scale.  Quanti- 
tative variation  in  a single  gene's  expression  can  be  caused  by  P-element 
insertional  mutations  all  over  the  genome  (Clark  et  al.  1995b).  On  the  other 
hand,  even  molecular  variation  in  the  immediate  proximity  of  a structural 
gene  can  have  complex  effects  on  the  gene's  expression.  The  activity  of  Adh 
varies  among  lines  with  the  same  electrophoretic  allele,  and  it  required 
extremely  careful  and  laborious  in  vitro  mutagenesis  and  transformation 
experiments  to  tease  apart  the  effects  of  variation  at  the  sites  (Stam  and  Lau- 
rie 1996).  The  conclusion  was  that  more  than  one  nucleotide  site  affects 
expression  in  a way  that  appears  to  be  epistatic. 

The  idea  of  using  interspecific  hybrid  crosses  to  understand  the  genetic 
basis  for  differences  between  species  was  introduced  in  Chapter  8,  and  it 
should  be  clear  that  the  use  of  sets  of  molecular  markers  make  these  methods 
even  more  powerful.  Many  plant  species  originated  through  ancient 
hybridization  events,  and  by  crossing  extant  species  and  following  the  fate  of 
molecular  markers,  the  genomic  composition  of  the  hybrid  descendants  can 
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be  followed.  A pair  of  species  of  Helianthus  (sunflower)  were  thought  to  have 
hybridized  to  give  rise  to  a third  species,  and  all  three  species  can  be  found 
today.  The  two  original  species  were  crossed,  backcrossed  for  two  genera- 
tions then  selfed  for  two  generations  before  DNA  was  extracted  and  exam- 
ined at  197  RAPD  marker  loci  (Rieseberg  et  al.  1995).  Many  parts  of  the 
genome  were  invariant.  More  strikingly,  the  genomic  composition  of  the 
hybrids  was  remarkably  similar  to  that  of  the  extant,  putatively  hybrid  origin 
species  of  sunflower  (Rieseberg  et  al.  1996).  Sometimes  the  genetic  basis  for 
interspecific  differences  is  dominated  by  one  or  a few  genes  of  major  effect, 
as  was  found  by  Doebley  et  al.  (1995)  for  the  difference  in  glume  architecture 
between  maize  and  teosinte.  Another  application  of  interspecific  crosses 
showed  that  the  genetic  basis  for  interspecific  differences  in  genital  arch 
shape  in  male  Drosophila  species  maps  to  several  genes  on  more  than  one 
chromosome  (Liu  et  al.  1996;  Laurie  et  al.  1997). 

When  one  considers  the  problem  of  complex  genetic  diseases  in  humans, 
it  becomes  clear  that  the  essence  of  the  problem  is  to  understand  how  genet- 
ic variation  in  a population  relates  to  phenotypic  variation  for  risk  of  dis- 
ease. The  problem  is  intrinsically  one  that  requires  an  understanding  of 
principles  of  population  genetics.  The  patterns  of  underlying  genetic  varia- 
tion, including  differences  among  populations  and  linkage  disequilibrium, 
are  primary  topics  of  population  genetics.  As  the  human  genome  project 
nears  completion,  the  rate  at  which  single  gene  disorders  are  discovered  and 
characterized  will  also  near  exhaustion.  It  is  becoming  evident  that  the  next 
big  problem  in  medical  genetics  is  to  understand  complex  disorders.  It  seems 
inescapable  that  the  field  of  population  genetics  is  about  to  see  a dramatic 
increase  in  visibility  and  importance.  We  hope  this  book  played  a part  in 
inspiring  some  people  to  begin  thinking  about  these  problems. 


SUMMARY 

Multifactorial  traits  are  affected  by  multiple  genes  and  usually  by  environ- 
mental factors  as  well.  Some  multifactorial  traits,  such  as  height  or  weight, 
are  continuous  in  that  they  demonstrate  a continuum  of  possible  phenotyp- 
ic values.  Other  multifactorial  traits,  known  as  meristic  traits,  have  their  phe- 
notypic value  determined  by  enumeration,  for  example,  by  counting  the 
number  of  bristles  on  a fruit  fly  sternite.  Still  other  multifactorial  traits  fea- 
ture an  underlying  continuum  of  liability  or  risk,  and  only  certain  individu- 
als with  liabilities  above  a threshold  are  affected.  These  three  basic  types  of 
multifactorial  traits  are  collectively  called  quantitative  traits. 

Many  quantitative  traits  have  a distribution  of  phenotypic  values  that  is 
approximately  normal  and  described  completely  in  terms  of  two  parameters, 
the  mean  p and  the  variance  a2.  Traits  that  are  not  normally  distributed 
sometimes  become  normal  when  measured  on  an  appropriate  scale,  such  as 
ln(x)  or  arcsin(Vx),  where  x represents  the  original  measurement  of  pheno- 
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type.  Much  of  the  theory  of  quantitative  genetics  is  based  on  the  assumption 
of  a normal  distribution  of  phenotypes. 

Truncation  selection  is  a method  of  individual  selection  in  which  all  indi- 
viduals whose  phenotype  lies  above  a certain  value  T (the  truncation  point) 
are  saved  and  mated  randomly  among  themselves  to  produce  the  next  gen- 
eration. If  p denotes  the  mean  phenotype  of  the  original  population  and  ps 
denotes  the  mean  phenotype  among  selected  parents,  then  the  mean  pheno- 
type among  the  progeny  (p')  is  given  by  u'-u  = h2( ps  - p),  where  h 2 is  the 
heritability  of  the  trait.  The  quantity  p'  - p is  usually  called  the  response  to 
selection  R,  while  ps  - p is  called  the  selection  differential  S;  so  the  prediction 
equation  for  individual  selection  can  be  written  as  R = h2S.  The  prediction 
equation  for  individual  selection  can  be  written  equivalently  as  R = iah 2 
where  i is  called  the  intensity  of  selection  and  i = S/a.  Intensity  of  selection  is 
a useful  quantity  with  which  to  compare  diverse  breeding  programs  because 
i depends  only  on  the  proportion  of  the  population  saved  for  breeding.  When 
lr  is  estimated  from  observed  values  of  R and  S,  then  h2  is  called  the  realized 
heritability.  Otherwise  h2  can  be  estimated  from  the  resemblance  between  rel- 
atives, as  exemplified  by  the  regression  coefficient  of  offspring  on  parent.  If  b 
denotes  the  regression  coefficient  of  offspring  on  a single  parent,  then  h 2 = 2b. 
If  b denotes  the  regression  coefficient  of  offspring  on  the  mean  of  the  parents 
(midparent),  then  h2  = b. 

In  terms  of  genetic  parameters,  h2=  12 pq[a  + (q  - p)d]2/a2  where  summa- 
tion is  carried  out  across  all  genes  affecting  the  trait  and,  for  each  gene,  p is 
the  frequency  of  the  favorable  allele,  a is  the  effect  of  the  gene  (measured  as 
the  average  difference  between  the  homozygotes),  and  d is  a measure  of 
dominance  (measured  as  the  deviation  of  the  heterozygote  from  the  mean  of 
the  homozygotes).  The  quantity  12 pq[a  + (q  - p)d]2  is  known  as  the  additive 
genetic  variance.  In  the  above  formula  for  h2,  the  symbol  o2  represents  the 
variance  in  phenotypic  value  in  the  population.  Thus,  the  heritability  is  the 
ratio  of  additive  genetic  variance  to  total  phenotypic  variance. 

Resemblance  between  relatives  can  also  be  used  to  partition  the  total  phe- 
notypic variance  of  a trait  into  components  due  to  genotype  and  environ- 
ment. The  total  genotypic  variance  can  be  further  partitioned.  In  the  absence 
of  genotype-environment  interaction  (evidenced  by  parallel  norms  of  reac- 
tion) and  in  the  absence  of  genotype-environment  association  (evidenced  by 
lack  of  correlation  between  genotype  and  environment),  then  the  total  phe- 
notypic variance  can  be  written  as  the  sum  of  a variance  term  due  to  the  addi- 
tive effects  of  genes  (the  additive  genetic  variance),  a term  due  to  dominance 
effects  (the  dominance  variance),  a term  due  to  interactions  between  genes 
(the  epistatic  variance),  and  so  forth.  The  ratio  of  the  additive  genetic  vari- 
ance to  the  total  phenotypic  variance  is  the  heritability  h2,  more  precisely 
called  the  narrow-sense  heritability.  This  is  the  heritability  that  is  important 
in  individual  selection.  Another  type  of  heritability  is  the  ratio  of  the  total 
genetic  variance  to  the  total  phenotypic  variance;  this  is  called  heritability  in 
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the  broad  sense.  Heritability  in  the  broad  sense  is  important  when  selection 
is  practiced  between  clones,  inbred  lines,  or  noninterbreeding  varieties.  Both 
types  of  heritability  are  specific  to  a particular  trait  in  a particular  population 
at  a particular  time,  because  they  depend  on  the  allele  frequencies  in  the  pop- 
ulation, the  environmental  variance,  and  the  additive,  dominance,  and  other 
effects.  Furthermore,  heritabilities  estimate  the  fraction  of  the  variance  in  a 
trait  that  is  due  to  genetic  causes,  either  in  the  narrow  sense  of  additive 
effects  or  in  the  broad  sense  of  all  genetic  effects.  The  heritability  is  not  infor- 
mative about  the  mean  phenotypic  value  of  a trait.  For  example,  the  mean 
can  be  altered  dramatically  by  a change  in  environment  without  affecting  the 
heritability. 

After  long-continued  selection  for  a quantitative  trait,  response  to  selec- 
tion eventually  ceases  and  the  population  reaches  a plateau.  A selection  limit 
may  be  reached  even  before  all  the  additive  genetic  variance  for  the  trait  has 
been  exhausted,  and  in  many  cases,  the  response  to  selection  ceases  because 
natural  selection  for  fitness-related  traits  counterbalances  the  artificial  selec- 
tion for  the  trait  in  question.  Nevertheless,  artificial  selection  can  produce  a 
total  response  of  five  or  ten  or  more  phenotypic  standard  deviations,  so  the 
trait  in  the  selected  population  may  be  well  beyond  the  range  of  what  it  was 
in  the  original  population.  This  seemingly  paradoxical  result  happens 
because  populations  used  in  selective  breeding  are  typically  small  enough  in 
number  that  not  every  possible  genotype  can  be  represented,  particularly 
genotypes  that  are  rare. 

Threshold  traits  are  traits  that  are  either  present  or  absent  in  an  individ- 
ual, but  their  presence  or  absence  is  determined  by  the  value  of  an  underly- 
ing quantitative  trait  known  as  the  liability.  The  heritability  of  liability  can  be 
calculated  from  established  principles  of  quantitative  genetics  when  one 
knows  the  incidence  of  the  trait  in  the  general  population  and  also  the  inci- 
dence among  offspring  or  other  relatives  of  affected  individuals. 

The  principles  of  quantitative  genetics  also  have  application  in  under- 
standing the  evolution  of  quantitative  traits  under  natural  selection.  The 
models  include  many  parameters  and  their  implications  can  be  investigated 
by  computer  simulation  or  mathematical  analysis  under  simplifying 
assumptions.  At  opposite  ends  of  the  spectrum  are  models  that  assume  a 
large  number  of  genes,  each  with  small  effects,  and  those  that  assume  a small 
number  of  genes,  each  with  a relatively  large  effect.  There  are  methods  for 
estimating  the  minimum  number  of  genes  affecting  a quantitative  trait,  such 
as  Wright's  method,  but  the  methods  are  biased  toward  underestimates. 
Another  approach  is  the  use  of  molecular  genetic  markers  to  detect  linkage 
with  loci  that  affect  the  quantitative  trait  of  interest,  which  are  called  QTLs  or 
quantitative  trait  loci.  Various  methods  for  detecting  QTLs  have  been 
devised,  but  they  are  all  have  difficulty  distinguishing  whether  an  identified 
QTL  is  due  to  a single  allele  of  large  effect  or  a group  of  linked  alleles,  each  of 
small  effect.  These  possibilities  can  be  sorted  out  by  the  study  of  multigener- 


Quantitative  Genetics 


479 


ation  pedigrees  in  which  recombination  has  the  opportunity  to  break  up 
blocks  of  linked  alleles. 


PROBLEMS 

1.  A recent  study  reports  that  the  heritability  of  violent  behavior  in  humans 
is  80%.  Many  people  think  that  this  means  that  people  with  the  at-risk 
alleles  (if  they  could  be  identified)  are  destined  to  become  violent.  What 
is  the  fallacy  in  this  argument? 

2.  The  following  are  the  weaning  weights  (lbs)  of  lambs  in  a large  flock. 
Estimate  the  mean,  variance  and  standard  deviation  of  weight  at 
weaning. 


68 

79 

93 

67 

73 

81 

82 

81 

85 

78 

72 

69 

64 

82 

77 

59 

68 

54 

71 

57 

88 

97 

69 

60 

92 

62 

64 

64 

90 

60 

3.  Suppose  a population  of  Drosophila  has  a normal  distribution  of  abdomi- 
nal bristles  with  a mean  of  20  and  a standard  deviation  of  2.  What  pro- 
portion of  the  population  is  expected  to  fall  into  the  following  categories 
of  bristle  number: 

a.  Between  18  and  22. 

b.  Greater  than  22. 

c.  Greater  than  24. 

d.  Between  20  and  22. 

e.  Smaller  than  16. 

4.  Two  inbred  varieties  of  tobacco  are  crossed  and  give  a variance  in  leaf 
number  in  the  Fj  generation  of  1.5.  The  variance  in  the  F2  generation  is 
6.0.  What  are  the  genotypic  and  environmental  variance  components  and 
the  broad  sense  heritability? 

5.  In  a population  of  the  flour  beetle  Tribolium  the  mean  weight  of  pupae  is 
2000  mg.  The  phenotypic  variance  is  40,000  mg2,  and  the  additive  genet- 
ic variance  is  10,000  mg".  If  individuals  with  a mean  pupa  weight  two 
phenotypic  standard  deviations  above  the  mean  are  selected,  what  is  the 
expected  average  pupa  weight  among  the  progeny? 

6.  If  a population  of  Drosophila  has  a mean  number  of  abdominal  bristles  of 
20  with  a narrow  sense  heritability  of  25%,  what  is  the  expected  bristle 
number  after  one  generation  when  the  selection  differential  is  four  bris- 
tles? What  is  the  expected  number  after  10  generations  of  equally  intense 
selection? 

7.  Five  generations  of  selection  for  decreased  plasma  cholesterol  level  in 
mice  decreased  the  mean  from  2.16  mg/100  ml  to  2.01  mg/100  ml.  The 
average  selection  differential  was  0.07  mg/ 100  ml.  What  is  the  realized 
heritability? 
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8.  A quantitative  trait  has  the  following  mean  values  for  the  genotypes  shown: 

44  44'  A'A' 

23.8  25.2  19.4 

a.  What  are  the  values  of  a and  d? 

b.  What  allele  frequency  of  A would  maximize  the  mean  value  of  the 
trait  in  the  entire  population? 

9.  Two  strains  of  rats  were  selected  for  increased  or  decreased  pigmentation 
on  the  head  and  back.  After  10  generations  the  high  strain  had  a rating  of 
3.73  and  the  low  strain  a rating  of -2.01.  The  strains  were  crossed,  and  the 
standard  deviations  in  the  Fj  and  F2  generations  were  0.87  and  0.60, 
respectively.  Estimate  the  effective  number  of  factors  affecting  the  trait  in 
these  strains. 

10.  Estimate  the  correlation  coefficient  in  litter  size  between  first  and  second 
litters  using  the  following  data  from  10  females. 

First  litter  8 9 9 10  10  10  11  11  13  13 

Second  litter  6 8 12  10  12  12  9 10  12  12 

11.  How  many  generations  of  selection  with  a selection  differential  of  20 
would  be  required  to  increase  the  average  number  of  eggs  laid  per  hen 
per  year  from  180  to  220,  given  a heritability  of  20%? 

12.  A standard  normal  distribution  is  a normal  distribution  with  mean  0 and 
variance  1.  Using  this  distribution,  calculate  the  mean  of  the  top  V2,  V4, 
Vs,  Vi6,  and  V32  of  the  population.  These  values  are  the  standardized  selec- 
tion intensities  (i)  for  the  proportions  saved,  which  correspond  to  trunca- 
tion points  of  0.00,  0.68, 1.16, 1.54,  and  1.86,  respectively. 

13.  If  the  selection  differential  differs  in  males  and  females,  show  that  the 
proper  value  to  use  in  Equation  9.10  is  the  mean. 

14.  Consider  a locus  with  genotypes  AA,  AA’ , A'A'  whose  contribution  to  a 
quantitative  trait  has  a = 0.6  and  d = 0.2;  another  locus  with  genotypes  BB, 
BB',  B'B'  contributes  to  the  same  trait  with  a = 0.4  and  d = 0.  If  the  loci  are 
unlinked  and  additive  and  the  allele  frequencies  of  A and  B are  0.5  and 
0.7,  respectively,  calculate  the  narrow-sense  and  broad-sense  heritability 
of  the  trait  when  the  total  phenotypic  variance  is  1.0. 

15.  A herd  of  dairy  cattle  yields  milk  with  a fat  content  of  3.4%  ± 0.65% 
(mean  ± standard  deviation)  and  a protein  content  of  3.3%  ± 0.45%.  The 
heritabilities  of  these  traits  are  60  and  70%,  respectively,  and  the  genetic 
correlation  is  0.55.  If  selection  is  practiced  for  percent  protein  with  a selec- 
tion intensity  of  i = 1.5,  what  increase  in  percent  protein  and  percent  fat 
would  be  expected?  What  intensity  of  selection  would  produce  the  same 
increase  in  percent  fat  by  direct  selection? 
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16.  Show  that  a simple  Mendelian  dominant  gene  has  a narrow  sense  heri- 
tability  of  2(1  - q)/ (2  - q),  where  q is  the  frequency  of  the  dominant  allele. 

17.  For  an  overdominant  locus  with  two  alleles,  show  that  the  additive  genet- 
ic variance  at  equilibrium  equals  0. 

18.  The  intensity  of  selection  i is  the  mean  of  the  selected  parents  in  a stan- 
dard normal  distribution  when  6 is  the  proportion  saved.  It  equals  the 
selection  differential  in  units  of  standard  deviation,  so  i = (ps  - p)/a. 
Over  the  range  B = 0.05  to  6 = 0.005,  i is  given  approximately  by  i = 0.8  + 
0.41  In [(1/B)  - 1]  (Simmonds  1977).  Calculate  i for  B = V2,  V4/  V8,  V16, 
and  y32  and  compare  with  the  corresponding  values  in  Problem  12. 

19.  If  a normally  distributed  phenotype  X,  with  truncation  point  T,  is  trans- 
formed to  a standard  normal  x = (X  - p)/o,  with  transformed  truncation 
point  t = (T-  p)/ c,  what  is  the  relationship  between  the  ordinate  Z of 
the  untransformed  distribution  at  T to  the  ordinate  z of  the  standardized 
distribution  at  the  point  f? 
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Feldman,  M.  W.  and  F.  B.  Christiansen.  1986.  Population 
Genetics.  Blackwell  Scientific  Publications,  Palo  Alto, 
CA.  An  elementary  text  with  emphasis  on  human 
examples. 

Futuyma,  D.  J.  and  M.  Slatkin.  1983.  Coevolution.  Sin- 
auer  Associates,  Sunderland,  MA.  Chapters  con- 
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problem  which  is  just  as  topical  today. 
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Hillis,  D.  M.,  C.  Moritz  and  B.  K.  Mable  (eds.).  1996. 
Molecular  Systematics,  Second  Edition.  Sinauer  Asso- 
ciates, Sunderland,  MA.  An  introduction  to  the 
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Kimura,  M.  and  T.  Ohta.  1971.  Theoretical  Aspects  of  Pop- 
ulation Genetics.  Princeton  University  Press,  Prince- 
ton, NJ.  An  overview  of  the  application  of  diffusion 
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process  of  the  Wright-Fisher  model. 

Levine,  L.  (ed.).  1995.  Genetics  of  Natural  Populations:  The 
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Lewin,  B.  1997.  Genes  VI.  John  Wiley  and  Sons,  New 
York.  Population  geneticists  need  to  learn  a lot  of 
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CHAPTER  1 

1.  Some  of  the  possible  causes  of  variation  include:  (a)  variation  in  size  of 
the  flies,  (b)  variation  in  primary  sequence  of  G6PD,  giving  allelic  forms 
with  different  specific  activities,  (c)  variation  in  transcription  rates,  due 
to  either  cis  or  trans  factors,  (d)  variation  in  post-transcriptional  process- 
ing of  the  message  to  produce  functional  mRNA,  (e)  variation  in  stability 
of  the  mRNA,  (f)  variation  in  rates  of  translation,  (g)  variation  in  post- 
translational  processing  of  the  protein,  (h)  variation  in  protein  stability. 

2.  One  might  expect  that  there  would  be  defects  at  all  the  steps  giving  rise 
to  PKU,  but  in  fact  the  overwhelming  majority  of  cases  are  caused  by 
defects  in  the  primary  gene  sequence.  Not  all  are  amino  acid  changes  or 
nonsense  mutations,  however.  There  are  many  defects  at  splice  junctions, 
resulting  in  improper  splicing  of  the  mRNA.  Mutations  in  trans-acting 
factors  that  cause  a complete  loss  of  enzyme  activity  have  not  been 
found,  presumably  because  they  would  cause  misexpression  of  other 
genes  as  well,  so  that  defects  in  trans- factors  are  not  identified  as  PKU. 

3.  Of  the  576  possible  single-base  mutations,  438  yield  a codon  that 
encodes  something  other  than  the  premutation  codon. 

4.  Many  amino  acid  replacements  can  be  made  by  a single  nucleotide 
change,  such  as  phenylalanine  (UUU)  to  leucine  (CUU).  Other  replace- 
ments require  two  or  even  three  nucleotide  changes.  For  example, 
methionine  (AUG)  codons  must  be  changed  at  all  three  sites  to  encode 
cystine  (UGU  or  UGC).  The  rate  of  mutation  from  one  amino  acid  to 
another  depends  on  the  number  of  underlying  nucleotide  changes  that 
must  occur. 
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5.  Pleiotropic  effects  of  mutations  result  in  many  examples  of  apparently 
disparate  phenotypic  effects  of  mutations.  Model  systems,  like 
Drosophila,  provide  some  of  the  best  examples.  For  example,  the  gene 
maleless  encodes  an  RNA  helicase  and  is  important  for  normal  dosage 
compensation,  increasing  the  rate  of  X-chromosome  transcription  in 
males  only.  Mutations  in  maleless  result  in  male  death.  Other  mutations 
in  maleless  result  in  a lower  temperature  at  which  flies  are  incapacitated 
(previously  thought  to  be  caused  by  the  a gene,  no-action-potential,  now 
known  to  be  the  same  as  maleless).  Examples  like  this  are  common  in  the 
literature;  especially  common  are  the  disparate  pleiotropic  effects  seen  in 
human  genetic  disorders. 

6.  The  males  produce  gametes  that  are  (e  st)  and  (e+  st+ ) in  proportions  V2 : 
V2  because  they  do  not  undergo  meiotic  recombination.  Females  pro- 
duce gametes  with  respective  frequencies:  ( e st)  0.315,  (e+  st+)  0.315,  ( e 
st+)  0.185,  and  ( e+  st)  0.185.  Draw  a 2 x 4 Punnett  square  to  get  the  fre- 
quencies of  the  eight  genotypes  of  the  next  generation  by  multiplying 
the  respective  gamete  frequencies  of  males  and  females.  For  example,  the 
genotype  (e  st)/ (e  st)  and  ( e+  st+)/ ( e+  st+)  will  each  have  frequency  V2  x 
0.315  = 0.158,  and  {e  st)/ ( e+  st+)  will  have  frequency  0.315.  All  four  other 
genotypes  have  frequency  V2  x 0.185  = 0.092. 

7.  Assuming  that  the  chance  of  having  a son  or  daughter  is  each  V2,  such  a 
stopping  rule  would  have  no  effect  on  the  sex  ratio,  which  would 
remain  at  x/2 : V2.  To  show  this  result,  make  a table  with  the  counts  of 
sons  and  daughters  of  all  possible  families  up  to  five  daughters  and  one 
son  (and  five  sons  and  one  daughter).  Adding  up  the  fraction  of  fami- 
lies of  each  type  multiplied  by  the  counts  of  the  two  sexes  will  show 
that  the  sum  converges  to  V2. 

8.  The  chance  of  producing  one  gamete  that  is  ABCD  is  V2  x '/2  x 1 x i/2  = 
y8.  The  chance  of  producing  two  such  gametes  is  V8  x % = V64,  again 
applying  the  multiplication  rule  because  the  two  gametes  are  indepen- 
dent of  one  another. 

9.  Many  such  problems  in  probability  arise  in  calculating  genetic  risk  in 
pedigrees.  In  this  case,  both  parents  have  to  be  heterozygotes  in  order  to 
have  produced  an  affected  offspring.  The  fraction  of  offspring  of  a cross 
Aa  x Aa  that  is  heterozygous  is  V2,  so  this  is  the  chance  that  the  sib  is  a 
carrier. 

10.  The  binomial  variance  is  pq/N.  For  p = q = i/2,  this  is  1/4N.  We  would 
reject  the  null  hypothesis  of  p = q = V2  if  the  observed  frequency  were 
more  deviant  than  two  standard  deviations.  Solving  2 x V(l/4 N)  = 0.05, 
we  get  N = 400.  This  means  if  we  counted  only  200  offspring,  a sex  ratio 
of  55  : 45  would  not  be  significantly  different  from  50  : 50. 
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11.  One  could  calculate  the  frequency  of  the  A morph  in  each  population 
and  the  standard  error  of  the  frequency  estimates.  If  the  confidence 
intervals  of  these  frequency  estimates  overlap  sufficiently,  then  the 
apparent  differences  in  frequency  could  be  accounted  for  by  sampling 
effects.  More  simply,  one  can  do  a heterogeneity  yf  test.  In  this  case,  this 
test  is  done  as  a standard  2x2  contingency  table.  Let  the  first  row  be 

a = 26,  b = 28  and  the  second  row  be  c = 10  and  d = 21;  the  yf  value  is  (ad  - 
bc)~N/[(a  + b)(c  + d)(a  + c)(b  + d)]  = 2.04.  (N  = a + b + c + d).  This  is  less 
than  the  3.84  critical  value  for  one  degree  of  freedom,  and  we  conclude 
that  it  is  possible  these  samples  are  from  the  same  population. 

12.  (a)  Allele  frequency  of  PGI-2a  = (70  + 19)/114  = 0.78,  of  PGl-2b  = 0.22.  (b) 
Expected  genotype  frequencies  are  0.782  = 0.61,  2(0.78)(0.22)  = 0.34,  and 
(0.22)2  = 0.05,  giving  expected  numbers  of  34.8, 19.4,  and  2.8. 

13.  On  a log  scale,  the  plot  of  the  human  population  size  versus  time  curves 
upward  dramatically.  This  means  that  our  population  is  growing  consid- 
erably faster  than  at  an  exponential  growth  rate.  Think  about  this  as  you 
work  Problem  14. 

14.  If  one  generation  takes  12  days,  a year  is  approximately  30  generations. 
Each  generation  the  population  increases  250-fold,  so  25030  = 8.67  x 1071  is 
the  number  of  flies.  One  kilogram  is  106  milligrams,  so  the  weight  will  be 
8.67  x 10b5  kilograms.  The  mass  of  the  earth  is  5.97  x 1024  kg,  so  apparent- 
ly flies  are  not  able  to  keep  on  breeding  at  their  maximum  rate. 

CHAPTER  2 

1.  80  mm  is  the  mean  plus  one  standard  deviation.  We  know  that  68  % of 
the  distribution  lies  within  one  standard  deviation.  This  means  that  34% 
are  between  the  mean  and  plus  one  standard  deviation.  Since  50%  of 
the  population  falls  below  the  mean  (the  normal  distribution  is  symmet- 
ric), this  implies  that  50%  + 34%  = 84%  of  the  population  is  smaller  than 
80  mm. 

2.  90  mm  is  two  standard  deviations  above  the  mean.  We  know  that  95%  of 
the  distribution  lines  within  two  standard  deviations  of  the  mean,  so 
47.5%  falls  between  the  mean  and  plus  two  standard  deviations  (between 
70  and  90  mm  in  our  example).  Because  34%  fall  between  70  and  80  mm, 
this  means  that  the  difference,  or  13.5%,  falls  between  80  and  90  mm. 

3.  The  sum  of  the  counts  is  193,  and  there  are  14  counts,  so  the  sample  mean 
is  193/14  = 13.785.  The  sum  of  squared  counts  is  2679,  so  the  sample  vari- 
ance is  [(2679  - (193V14)]/13  = 1.412.  The  sample  standard  deviation  is 
the  square  root  of  the  sample  variance,  or  1.188.  The  standard  error  of  the 
mean  is  the  standard  deviation  divided  by  the  square  root  of  the  sample 
size,  or  0.318. 
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4.  The  standard  error  is  V(64/100)  = 0.8.  The  difference  in  means  of  the 
large  sample  and  the  smaller  test  sample  is  60  - 58  = 2.  This  difference  is 
2/0.8  = 2.5  standard  errors.  Because  we  expect  95%  of  random  samples 
from  the  same  population  to  have  a mean  within  two  standard  errors 
and  the  observed  mean  is  more  deviant  than  this,  we  conclude  that  the 
sample  population  is  significantly  smaller  in  size  than  the  rest  of  the 
species.  The  actual  probability  of  getting  a mean  this  small  (or  smaller) 
by  chance  is  0.0062. 

5.  This  is  an  example  of  the  central  limit  theorem  in  action.  Adding  12  uni- 
form random  numbers  gives  a resulting  sum  whose  distribution  is 
approximately  normal. 

6.  Positive  covariance  would  mean  that  successive  numbers  are  more  simi- 
lar than  they  would  be  had  the  numbers  been  independent.  This  would 
result  in  a smaller  variance  than  expected  under  the  central  limit  theo- 
rem. 

7.  This  is  most  likely  an  X-linked  trait,  a hypothesis  that  can  be  easily  test- 
ed if  females  and  sets  of  their  offspring  can  be  examined.  The  males 
seem  to  have  80%  F and  20%  S alleles,  and,  if  you  count  up  the  number 
of  F and  S alleles  in  the  females,  they  too  have  80%  F alleles  and  20%  S 
alleles. 

8.  Heterozygotes  would  produce  both  F and  S polypeptide  chains,  which 
could  be  assembled  into  dimers  that  are  FF,  FS,  and  SS.  These  three  mol- 
ecules would  most  likely  have  different  electrophoretic  mobility,  giving 
three  bands  on  a protein  gel.  In  the  case  of  tetramers,  the  heterozygote 
could  produce  FFFF,  FFFS,  FFSS,  FSSS,  and  SSSS.  These  may  or  may  not 
all  be  resolvable  on  a protein  gel,  but  in  the  best  case,  all  five  bands 
would  be  visible. 

9.  The  polymerase  chain  reaction  doubles  the  amount  of  a specific  frag- 
ment of  DNA  each  round  (assuming  perfect  efficiency).  This  means  that 
30  rounds  of  PCR  will  yield  230  = 1.074  x 109  copies.  In  reality  PCR  is 
rarely  more  than  80%  efficient. 

10.  If  one  starts  with  a large  number  of  template  molecules  of  DNA,  the  PCR 
reaction  begins  by  amplifying  many  of  them.  The  first  round  of  PCR  will 
have  errors,  but  there  will  be  many  different  errors,  and  all  will  be  rare. 
Subsequent  rounds  of  PCR  will  have  additional  errors,  but  at  the  end  of 
many  rounds  of  PCR,  each  individual  error  will  remain  rare.  Sequencing 
this  mix  will  give  the  most  common  apparent  base  at  each  position,  and 
this  should  be  the  true  base.  One  can  have  problems  sequencing  after  PCR 
reactions  if  the  initial  concentration  is  very  low  (only  a few  molecules  of 
template  DNA),  or  if  one  clones  a PCR  product.  Generally,  if  PCR  products 
are  cloned,  careful  investigators  insist  on  sequencing  more  than  one  to  be 
sure  PCR  artifacts  have  not  been  introduced. 
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11.  Levels  of  polymorphism  at  the  nucleotide  level  are  often  low.  In 
humans,  for  example,  two  sequences  will  differ  at  an  average  of  around 

0.1%  of  the  bases,  or  around  1 per  1000.  Any  method  of  calling  variation 
that  has  a false  positive  rate  of  even  one  percent  could  lead  to  serious 
misinterpretation  of  amounts  of  variation  in  the  population. 

12.  The  average  of  the  pairwise  differences  is  5.0,  so  the  average  heterozy- 
gosity per  site  is  5/1200  = 0.00417.  This  per-site  heterozygosity  is  the 
same  as  the  nucleotide  diversity,  n. 

13.  If  the  DNA  types  do  not  match,  one  can  be  certain  that  the  sample  from 
the  crime  scene  did  not  come  from  the  suspect  (barring  lab  errors).  This 
situation  is  known  as  exclusion.  Such  information  has  resulted  in  the 
release  of  many  wrongly  accused  people.  On  the  other  hand,  if  the  DNA 
types  do  match,  then  either  the  suspect  is  the  source  of  the  blood  from 
the  crime  scene,  or  someone  else  with  the  same  genotype  at  the  scored 
loci  was  the  source  of  the  blood.  The  chance  that  the  suspect  is  the  one 
now  depends  on  how  often  one  gets  such  a match  by  chance.  Thus  the 
whole  problem  of  matching  hinges  on  population  genetic  issues. 

CHAPTER  3 

1.  Allele  frequency  q = (1  / 10,000)1/2  = 0.01,  and  2 pq  = 2(0.99)(0.01)  = 0.0198 
is  the  frequency  of  carriers  (i.e.,  about  one  person  in  50). 

2.  Allele  frequency  of  d = (170/ 400) = 0.65,  of  D = 0.35.  Expected  geno- 
type frequencies  of  DD,  Dd,  and  dd  are  0.12,  0.46,  and  0.42.  Fraction  of 
heterozygotes  among  Rh+  is  0.46/(0.12  + 0.46)  = 0.79,  and  expected 
number  of  heterozygotes  is  0.79  x 230  = 182. 

3.  Allele  frequency  of  M = (2202  + 1496)/6200  = 0.60,  of  N = 0.40.  Expected 
numbers  are  (0.60)2(3100)  = 1116,  2(0.60)(0.40)(3100)  = 1488,  and 
(0.40)2(3100)  = 496.  The  *2  = (1101  - 1116)2/1116  + (1496  - 1488)2/1488  + 
(503  - 496)2/ 496  = 0.34  with  one  degree  of  freedom  (because  there  are 
three  classes  of  data  and  one  parameter  estimated  from  the  data),  with 
an  associated  probability  value  of  about  0.65. 

4.  Expand  (0.1A,  + 0.2A  + 0.3A  + 0.4A)2  to  obtain 
A\A\  (0.1)2  = 0.01; 

AA  2(0.1)(0.2)  = 0.04; 

AA  2(0.1  )(0.3)  = 0.06; 

AA  2(0.1  )(0.4)  = 0.08; 

A2A2  (0.2)2  = 0.04; 

AA  2(0.2)(0.3)  = 0.12; 

A2A4  2(0.2)(0.4)  = 0.16; 

AA3  (0.3)2  = 0.09; 
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A3A4  2(0.3)(0.4)  = 0.24; 

A4A4  (0.4)2  = 0.16. 

5.  If  heterozygous  parental  genotype  is  Aa,  half  the  gametes  contain  A and 
half  contain  a.  The  A gamete  yields  a heterozygous  offspring  when  it 
unites  with  an  a gamete  (probability  q)  and  the  a gamete  yields  a hetero- 
zygous offspring  when  it  unites  with  an  A gamete  (probability  p).  Overall 
probability  of  heterozygous  offspring  is  (V2)q  + ( V2)p  = (V2)(<J  + p)  = i/2. 

6.  If  the  dominant  and  recessive  allele  frequencies  are  p and  q,  then  D = p2, 
H = 2 pq,  and  R = q2,  so  DR  = p2q2  and  H = (2pq)2,  confirming  DR  = H2/ 4. 

7.  The  frequency  of  AA  = p2  = (1  - qf  = 1 -2  q + q2  = I - 2q;  frequency  of  Aa 
heterozygotes  = 2pq  = 2(1  - q)q  = 2 q-  lq2  = 2 q;  and  frequency  of  aa  = q2  = 0. 

8.  The  probability  that  an  individual  with  the  dominant  phenotype  is  het- 
erozygous  is  2 pq/ (p2  + 2 pq)  = 2 q/ (1  + q),  and  among  these  individuals,  the 
recessive  allele  frequency  is  V2;  hence,  the  frequency  of  recessive  allele 
among  dominant  phenotypes  is  [2^/(1  + q)](V2)  = q/(  1 + q). 

9.  Allele  frequency  q = 0.05.  Expected  frequency  of  homozygous  (color 
blind)  females  = (0.05)'  = 0.0025  (about  1/ 400);  expected  frequency  of 
carriers  = 2(0.95)(0.05)  = 0.095  (about  V10). 

10.  Allele  frequency  q = frequency  of  affected  males.  Frequency  of  carrier 
females  = 2 pq  = 2(1  - q)q  = 2 q - 2 q2  = 2 q when  q2  = 0.  Therefore,  the  fre- 
quency of  carrier  females  is  approximately  two  times  the  frequency  of 
affected  males. 

11 . Genotype  frequencies  given  by  expansion  of  (pA  + qa)4  instead  of  (pA  + 
qa)2. 

12.  By  the  definition  in  the  text,  a polymorphic  gene  is  one  for  which  the 
most  common  allele  has  a frequency  less  than  0.95,  making  genes  1,  2, 
and  5 polymorphic  in  this  sample,  and  giving  P = 3/5  = 60  percent.  Het- 
erozygosity with  random  mating  equals  1 - T.p2,  which  for  genes  3-5 
equals  0.47,  0.11,  0.01,  0,  and  0.37,  with  an  average  of  H = 0.19. 

13.  Normal  allele  is  dominant  because  of  phenotype  in  Fx.  The  %2  value 
equals  (88  - 93.75)'/93.75  + (37  - 31.25)2/31.25  = 1.41  with  one  degree  of 
freedom  (since  no  parameters  were  estimated  from  the  data).  The  asso- 
ciated probability  level  is  approximately  0.25,  so  the  hypothesis  of  a 3:1 
ratio  cannot  be  rejected. 

14.  Expected  ratio  obtained  from  the  expansion  of  [(3/4)D  + (V4)R]3,  where  D 
and  R represent  the  dominant  and  recessive  phenotypes,  respectively, 
yielding  the  ratio  27:9:9:9:3:3:3:1  for  each  64  progeny.  Expecta- 
tions are  269.6  : 89.9  : 89.9  : 89.9  : 30.0  : 30.0  : 30.0  : 10.0  among  639  proge- 
ny, for  a x2  of  2.67  with  seven  degrees  of  freedom  (because  there  are 
eight  classes  of  data  and  no  parameters  estimated),  for  which  the  associ- 
ated probability  value  is  about  0.92.  This  is  a very  good  fit,  indeed. 
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15.  A1Bl  (0.3)(0.2)  = 0.06; 

A,B2  (0.3)(0.3)  = 0.09; 

A,B3  (0.3)(0.5)  = 0.15; 

A2BX  (0.7)(0.2)  = 0.14; 

A2B2  (0.7)(0.3)  = 0.21; 

A2B2  (0.7)(0.5)  = 0.35. 

16.  (a)  Assuming  linkage  equilibrium,  frequencies  of  gametes  AxBlr  AXB2, 
A2Bi,  and  A2B2  are  pxqX/  pxq2,  p2qx,  and  p2q2,  or  (0.7)(0.3)  = 0.21,  (0.7)(0.7)  = 

0.49,  (0.3)(0.3)  = 0.09,  (0.3)(0 .7)  = 0.21,  respectively,  (b)  Theoretical  maxi- 
mum of  D equals  the  smaller  of  pxq2  (0.49)  and  p2qx  (0.09),  or  0.09,  and  50 
percent  of  0.09  gives  D = 0.045.  Gametic  frequencies,  given  in  the  same 
order  as  in  (a),  are  0.21  + 0.045  - 0.255,  0.49  - 0.045  = 0.445,  0.09  - 0.045  = 
0.045,  and  0.21  + 0.045  = 0.255. 

17.  When  dominant  x dominant  matings  occur  at  random,  the  proportion  of 
homozygous  recessive  offspring  equals  the  square  of  the  recessive  allele 
frequency  among  individuals  with  the  dominant  phenotype,  or  [q/(  1 + q)]2, 
because  random  mating  of  individuals  is  equivalent  to  random  union  of 
gametes,  just  as  in  the  derivation  of  the  Hardy- Weinberg  principle.  With 
dominant  x recessive  matings,  the  probability  of  a recessive  gamete  from 
the  parent  with  the  dominant  phenotype  is  q/(l  + q),  and  from  the  parent 
with  the  recessive  phenotype  it  is  1;  altogether  the  probability  of  a 
homozygous  recessive  offspring  from  a dominant  x recessive  mating  is 
[q/ (1  + (/)](!)  = q/(  1 + q),  and  likewise  from  a recessive  x dominant  mating. 

CHAPTER  4 

1.  Average  frequency  before  fusion  = [(t/  + e)2  + (q  - e)2]/2  = q2  + e2;  after 
fusion  = q2;  difference  = e2  = variance  in  allele  frequency  among  subpop- 
ulations. 

2.  Multiply  (1  - FIS)(1  - FST)  = 1 - FIT  and  cancel  the  Is.  The  expression  says 
that  the  probability  of  autozygosity  within  the  total  population  equals 
the  probability  of  autozygosity  because  of  inbreeding  within  a subpopu- 
lation, plus  the  probability  of  autozygosity  because  of  random  genetic 
drift,  minus  the  probability  of  autozygosity  for  both  reasons. 

3.  Heterozygosities  are  0.54,  0.62,  0.66,  respectively,  average  0.61.  Fused 
population  has  allele  frequencies  0.2,  0.3,  0.5,  and  heterozygosity  0.62. 
Fst  = 0.02.  Maximum  FST  occurs  when  each  population  is  fixed  for  a dif- 
ferent allele,  and  FST  = 1 for  the  set  of  three  as  well  as  for  each  pairwise 
comparison. 

4.  Allele  frequency  is  0.2  in  both  subpopulations,  giving  Hs  = 0.32.  Average 
Hx  = 0.272.  FIS  = 0.15.  (Note  from  the  genotype  frequencies  that  the  pop- 
ulations have  inbreeding  coefficients  0.1  and  0.2,  respectively,  which 
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average  0.15).  Since  the  allele  frequencies  are  identical,  HT  = 0.32  also,  so 
Fst  = 0.  Since  (1  - Fn)  = (1  - F1S)(1  - FST),  FIT  = 0.15. 

5.  Gametic  frequencies  are  p-[q]  + e,  pxq2  - e,  p2<?i  - e,  p2(?2  + e in  one  popula- 
tion, - e,  p^q2  + e,  p2qx  + e,  p2q2  - e in  the  other;  D in  fused  population 
equals  0. 

6.  p2  - p2F  + pF  = p2  + p{  1 - p)F  = p2  + pqF  = p(  1 - q)  + pqF  = p-  pq  + pqF  = p- 
pq(l-F). 

7.  Random  mating  frequencies  V4,  V2,  %;  among  offspring  of  first  cousins 

(f  = Vie)/  they  are  17/64,  30/64,  17/64.  Deficiency  of  heterozygotes  = - 

3%4)/(32/64)  = V16  = F. 

8-  q = (1/1 600) 1 2 = 1/40,  and  expected  frequency  from  first-cousin  matings 
is  (1/1600)(15/16)  + (V40)(Vi6)  = 1/465. 

9.  Proportions  with  first-cousin  parents  are  0.016,  0.022,  0.068,  0.120,  and 
0.391  for  the  frequencies  given.  When  q = 1,  the  proportion  is  0.01, 
which  says  that  one  percent  of  the  matings  are  between  first  cousins. 

10.  With  inbreeding  F,  mean  allele  frequency  equals  [p2(l  - F)  + pF](l)  + 
[2pq(l  - F)](i/2)  = p.  Mean  square  MS  = [p2(l  - F)  + pF](l)  + (2pq(\  - 
fW/2)2-  Variance  equals  MS  - p2  = {pq/ 2)(1  + F).  When  F = 0 (random 
mating),  variance  = pq/ 2.  When  F = 1 (complete  inbreeding),  variance  = 
pq. 

11.  H3  = 0.62,  H2  = 0.48,  Hs  = 0.55,  HT  = 0.615,  and  FST  = 0.106.  /s  = [(0.04  + 
0.36)(2  + (0.09  + 0)/2  + (0.25  + 0.16)/2]  = 0.45,  JT  = [(0.4)2  + (0.15)2  + 

(0.45)  ] = 0.385,  and  Gsx  = 0.106.  Note  that  /s  = 1 - Hg;  Jj  = 1 — Hj;  so  Ggj 
= fsT- 

12.  FST(1)  = (0.48  - 0.40)/0.48  = 0.167;  FST(2)  = (0.255  - 0.21)/0.255  = 0.176; 

^st(3)  = (0.495  - 0.49)/0.495  = 0.010.  Average  frequencies  are  pi  = 0.4,  p2  = 
015,  p3  = 0.45,  and  the  weighted  average  FST  = 0.106,  which  equals  the 
Gst  calculated  in  the  preceding  problem. 

13.  Because  a male  contributes  the  Y chromosome  to  his  sons,  paths  with 
two  or  more  consecutive  males  have  probability  0 for  the  transmission  of 
an  X-linked  gene. 

14.  Equilibrium  F = 0.2/ (2  - 0.2)  = %.  Genotype  frequencies  are  {Vg){%)  + 
(1/))(Vg)/  (%)(%),  {%){%)  + (2/3)(y9),  or  0.135,  0.395,  0.469. 

15.  FA  = FB  = 0;  Fc  = FD  = 0;  FE  = Ff  = 2(V2)3  = %;  FG  = FH  = 2(i/2)3  + 4('/2)5  = 3/8; 

= (V2)  (1  + Fe)  + (V2)3(l  + Ff)  + 4(!/2)5  + 8(V2)7  = 8/16.  Note  that  the  pedi- 
gree is  of  three  generations  of  sib  mating. 

16.  F = 1/2  in  odd-numbered  generations  (the  progeny  of  selfing),  and  F = 0 
in  even-numbered  generations  (the  progeny  of  random  mating). 

17.  F0  = 0,  Fj  = 1,  F2  = y4,  F3  = 3/8,  F4  = ®/16,  F5  = 19/32  = 0.59,  so  genotype  fre- 
quencies after  five  generations  are  0.21,  0.17,  0.61;  one  additional  gener- 
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ation  of  random  mating  restores  the  Hardy-Weinberg  frequencies  of 

0.09,  0.42,  0.49. 

18.  (a)  FST  = 1 - (1  - 1/100)50  = 0.39;  Fts  = V4;  FIT  = 1 - (0.75)(0.61)  = 0.54.  (The 
exponent  in  Fsx  is  50,  not  47,  because  random  drift  still  occurs  during 
the  generations  of  sib  mating.)  (b)  FST  = 0.39;  FIS  = 0;  FIT  = 0.39. 

19.  AXA2,  A2A3,  A\A3  styles  occur  in  equal  frequency  at  equilibrium,  so  the 
probability  of  any  pollen  grain  landing  on  a compatible  style  is  y3. 

20.  Two-way  hybrid  has  genotype  AXA2  and  gametes  V2  Ax  + V2  A2,  so  the 
probability  of  identity  by  descent  (F)  among  offspring  is  (y2)2  + (V2)2  = 

V2;  three-way  hybrid  has  gametes  V4  Au  V4  A2/  V2  A3,  and  F = (V4)2  + (%)2 
+ (V2)2  = %;  four-way  hybrid  has  gametes  % A1,1/4  A2,  V4  A3/  V4  A4/  and  F 

= 4(y4)2  = y4. 

21.  Ft  = (V2)2(l  + F f_2)  + (V2)3(l  + Ff_3)  + (V2)4(l  + Fm)  + . . . = (V2)2(l  + F,_2)  + 
(V2)F(-i  = y4  + (y2)F,-i  + (y4)Ft-2.  Therefore,  F0  = 0,  Fj  = 0,  F2  = V4,  F3  = 3/8,  F4 

= %6,  F5  = 27/32. 

22.  F,  = (V2)2(l)  + (y2)3(l)  + (y2)4(l)  + . . . = (V2)2  + (1/2)F,_1;  therefore,  F0  = 0,  F3 
= y4,  F2  = 3/8,  F3  = 7/16,  F4  = ls/32,  F5  = 31/64.  The  equilibrium  is  given  by  F 
= (V4)  + F/2,  or  F = V2. 


CHAPTER  5 

1.  Various  kinds  of  mutations  anywhere  along  the  gene  can  result  in  loss  of 
gene  function;  once  a gene  has  mutated,  only  very  specific  kinds  of 
reverse  mutations  will  restore  function. 

2.  The  experiment  proved  the  point  because  the  antibiotic-resistant  cells 
in  the  colonies  on  the  unselective  plate  were  never  themselves  exposed 
to  the  antibiotic. 

3.  P0  = 11/20  = 0.55  and  m = -[ln(P0)]/N  = 1.1  x 10“9  per  generation. 

4.  Use  the  Poisson  zero  term  P0  = exp(-w)  with  P0  = 1 - 0.35  = 0.65,  giving 
m = -lnP0  = 0.43  as  the  average  number  of  lethals  per  chromosome. 

5.  Dominant  lethals  (2  to  10  x 10~2)/(5  x 10"4)  = 40  to  200  rads;  recessive 
visibles  (8  x 10~6)/ (7  x 10-8)  = 114  rads;  reciprocal  translocations  (2  to  5 x 
HP4)/ (1  to  2 x 10“5)  = 10-50  rads.  (Incidentally,  humans  appear  to  be 
somewhat  less  radiation  sensitive  than  mice.) 

6.  With  tn  = 0,  p,  = p0(  1 - p)'.  For  10, 100, 1000,  and  10,000  generations,  p, 
equals  0.99995,  0.9995,  0.995,  and  0.95,  respectively.  Note  that  the 
approximation  p,  = 1 - pf  is  very  accurate  in  this  case. 

7.  Use  p,  = p0(l  - p)f  with  p0  = 1,  p = 0.01,  and  pt  = 0.90.  Then  t = 
ln(0.90)/ln(0.99)  = 10.5  generations. 

8.  Use  cjt  = cj 0 + pf.  For  t = 0 to  12  generations,  each  four-generation  interval 
increases  q by  2 x 10  6,  so  p = (2  x 10“6)/4  = 5 x 10~7  per  generation,  (b)  For 
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t = 12  to  24,  each  interval  increases  q by  0.04  x 10-6,  so  p = (0.04  x 10”6) /4  = 
10  8 per  generation.  The  novel  metabolite  decreases  the  mutation  rate; 
such  substances  are  called  antimutagens. 

9.  (a)  Equilibrium  p = p/(  p + v)  = i/u;  (b)  Vwi ; (c)  V2 ; (d)  yu. 

10.  The  equation  for  reversible  mutation  implies  that  {p,  - p)/(p0  - p)  = 

(1  - p - v)',  and  halfway  to  equilibrium  means  that pt-p  = (y2)(p0 - p),  so 
(1  - p - v)f  equals  y2.  Therefore  t ln(l  - p - v)  = ln(V2)  or,  using  the  approx- 
imation, t = ln2/ ( p + v)  = 0.7/(  p + v).  For  the  values  of  p and  v specified, 
f = 6.4  x 104  generations. 

11.  qi=  q0  + p0;  <?2  = (]i  + Pi  = qo  + Po  + Pb  and  so  on  to  obtain  qt  = q0  + I p;. 
Can  write  this  as  q0  + pf  if  p is  interpreted  as  the  arithmetic  mean  of  the 
mutation  rates  (Ep,)/f. 

12.  The  effective  number  of  alleles  equals  the  reciprocal  of  homozygosity 
(sums  of  squares  of  allele  frequencies).  These  examples  have  ne  = 2.8 
and  8. 

13.  Rare  alleles  contribute  negligibly  to  the  homozygosity;  hence  the  effec- 
tive number  of  alleles  is  essentially  independent  of  the  number  of  rare 
ones. 

14.  Use  the  relation  F = l/[4 N(m  + p)  + 1]  with  N = 50,  p = 1(T5  and  m = 10“3 
to  obtain  F = 0.83,  so  H = 1 - F = 0.17. 

15.  The  expected  number  of  neutral  alleles  in  a sample  of  size  two  equals  1 
+ x when  0 = x/(l  + x).  Since  H = 0/(1  + 0),  the  heterozygosity  is  H = x. 

16.  Use  Ft  = 1 - [1  -1/(2 N)]'  with  N = 50  and  t = 200,  so  F200  = 0.866. 

17.  Use  F,  = 1 - [1  - 1/(2N)]'  s 1 - exp(-t/2N)  with  Ft  = i/4.  Thus  t = -2Mn(3/4) 

= 0.6 N generations. 

18.  Use  the  equation  p,  = p + (p0  - p)(  1 - mf  with  p,  = 0.5,  p0  = 0.2,  p = 0.8 
and  m = 0.01.  Then  t = 69  generations. 

19.  (1  - m)w  = 0.6,  so  allele  frequencies  are  0.32,  0.44,  0.56,  and  0.68. 

20.  Using  p,-p  = (p0-  p)(  1 - m)',  note  that  I(p,  - p)2  = E(p0  - p)2(l  - m)2t,  or  sf 
= s0(l  - m)2t. 

21.  Use  F = l/(4Nm  + 1)  with  l/(4Nm  + 1)  < 0.05,  so  m > 4.75/N. 

CHAPTER  6 

1-  Pi  = Po/ (1  - <?oSo);  = qoSo/  (1  - ^o);  and  so  on  give  {pi/q{)  = (po/q0)s0; 

Pz / ^2  = (Po/ <7o)(soSi);  • ■ • ; P„/ tfn  = (po/ ^o)(soSi  ■ • ■ s„_i).  To  use  the  formula 
with  s,  set  s = (s0S|  . . . sn_{)l/",  so  s is  the  geometric  mean  of  the  selection 
coefficients. 

2.  Use  the  general  equation  for  two  allele  viability  selection  p'  = (p2wu  + 
pqwl2)/w,  to  get  = 0.736,  p2  = 0.768,  and  p3  = 0.796. 
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3.  Defining  the  fitnesses  as  1 - s : 1 : 1 - f,  the  equilibrium  allele  frequency 
is  p = f/(s  + t),  where  (a)  s = 0.7,  t = 0.3;  (b)  s = 0.07,  t = 0.03;  (c)  s = 0.007, 
t = 0.003.  In  all  cases  p = 0.3. 

4.  w = (0.64)(0.9)  + 2(0.8)(0.2)(1)  + (0.04)(0.6)  = 0.92.  Note  that  0.8  = 0.4/(0.4  + 
0.1),  so  0.8  is  the  equilibrium  frequency  for  this  case  of  overdominance, 
and  in  this  model  the  equilibrium  occurs  at  maximum  zv. 

5.  q'  = pq(l  -h)/[p2  + 2pq{l-h)]  = q{l-h)/{\  + q-2qh)  = q(  1 - h ) with  q'/q  = 
0.99  given,  so  h = 0.01. 

6.  t = 4[-2.2  + 11.5]  = 37.2  generations.  Note  that  the  equation  can  be  writ- 
ten as  t = (2/s)[ln (pt/ p0)  - In (qt/q0)],  and  with  pt/  p0  « 1,  ln(qt/q0)  = 0. 

For  a large  range  of  values  of  pt/po  and  s,  t ranges  from  5 to  50.  This 
explains  in  part  why  significant  pesticide  resistance  evolves  so  rapidly  in 
diverse  species. 

7.  For  the  haploid,  p'  = p/(  1 -qs).  For  the  diploid,  p'  = [p2  + pq(  1 - s)]/[p2  + 2pq(l 
-s)  + q1{  1 -s)2]  = p[p  + q(l-s)]/[p  + q{l  -s)]2  = p/[p  + q(  1 -s)]  = p/(l  -qs). 

8.  Interchange  p and  q and  change  the  sign  of  s to  obtain  In (qt/pt)  + 1 /pt  = 
[ln(<7o/ po)  + l/p0]  - st,  or  In (pt/qt)  - 1 /pt=  [Mpo/qol  - 1/pol  + st,  where 
p,  is  now  the  allele  frequency  of  the  favored  recessive  and  s is  the  selec- 
tion coefficient  against  individuals  with  the  dominant  allele. 

9.  dAp/ dp  = (i/2  - p)(  1 -p)-  p{V2  - p)  - p{  1 - p).  This  value  is  positive  for  p = 
0 and  p = 1 (indicating  unstable  equilibria)  and  negative  for  p = y2  (indi- 
cating local  stability).  The  latter  point  is  also  globally  stable. 

10.  For  a recessive  lethal,  wn  = zun  = 1,  zv22  = 0.  Then  q'  = pq/(p2  + pq)  = 
q/{l  + q),  or  1 / qn  = 1 + \/ qn.\,  which  implies  1 / qn  = n + 1 / q0,  or  qn  = 
q0/ (1  + nq0).  For  qn  = q0/2,  n = \/q0  generations. 

11.  Into  the  equation  q = p /hs  substitute  hs  = V2,  hence  q = 18  x 10“5;  the 
expected  frequency  of  affected  individuals  is  2pq  = 36  x 10  5. 

12.  From  the  equation  q = V(p/s)  = V(4xl0^/0.2)  = 4.5  x 10“3,  also,  q = 4 x 
lO^6/ (0.05)(0.2)  = 4.0  x 10-4.  The  five  percent  heterozygous  effect  reduces 
the  equilibrium  frequency  by  more  than  an  order  of  magnitude. 

13.  From  the  equation^  = V(p/s)  = ^(lO^/O.b)  = 1.3  x 10“3;  this  would  be  1.0 
x 10~3  if  homozygotes  did  not  reproduce. 

14.  Frequency  of  heterozygotes  = 2 q,  frequency  of  heterozygotes  from  new 
mutations  = 2p,  ratio  = p /q,  and  since  q = ]x  /h,  the  desired  ratio  is  h. 

15.  The  given  condition  is  (1/0.9)  + (vu/l)  > 2,  or  v12  > 0.89. 

16.  Substituting  into  the  given  expression  for  equilibrium,  zvn  = 1,  w12  = 1, 

Vi  = 1 - s,  v2  = 1,  we  get  p = 1 - s is  the  frequency  of  A,  so  q = s is  the  fre- 
quency of  a. 

17.  Set  p = 1 - 0.125  = 0.875  and  k = 0.75.  Then  zvu  satisfies  -2(0.25)re12/ (1  - 
2zvu)  = 0.875,  or  zvu  = 0.7.  Actually,  segregation  distorter  is  active  only  in 
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males,  and  a model  that  takes  this  into  account  yields  wu  = 0.79  in 
males  and  wu  = 1 in  females  (Hard  1970). 

18.  The  reason  is  that  homozygotes  for  some  alleles  may  be  superior  to  het- 
erozygotes containing  other  alleles. 

19.  The  equilibrium  allele  frequency  for  this  overdominant  case  is:  p = (wl2 
- XV22)/ (2 W12  - u>i  1 - W22)  = 0.375.  The  mean  fitness  at  this  point  is 
obtained  by  substituting  into  w = p2wu  + 2 pcjzvu  + cj2w 22  = 0.8125.  The 
marginal  fitness  of  the  A"  allele  is  0.8,  because  all  genotypes  bearing  this 
allele  have  a fitness  of  0.8.  Because  the  marginal  fitness  of  the  A”  allele  is 
less  than  the  mean  fitness,  A"  does  not  increase  in  frequency. 

20.  Fitnesses  are  A2A2,  0.7;  A^,  0.6;  A:A4,  0.5;  A2A3,  0.5;  A2A4,  0.4;  A3A4,  0.3; 
and  pi  = i/4;  w = (LAjAj  fitnesses  + TLA  A,  fitnesses)/16  = 0.5. 

CHAPTER  7 

1.  The  expected  heterozygosity  in  a finite  population  decreases  according 
to  H,  - (1  - l/2N)Hf_1.  In  this  case,  we  expect  H to  go  from  0.5  to  0.495  in 
one  generation.  Since  H fell  considerably  below  this,  it  would  appear 
that  there  was  substantial  inbreeding.  A y2  test  could  be  done  to  formal- 
ly test  the  statistical  significance. 

2.  Since  H,  = H0  e~(l/2N),  we  can  substitute  and  take  logs  to  get  ln(0.05)  = 
-f/[2(20)].  Solving,  we  get  t = 59.9  generations.  For  a population  that  is 
10  times  the  size,  the  time  to  decrease  to  5%  of  the  original  heterozygosi- 
ty is  10  times  as  long,  or  599  generations. 

3.  For  an  autosomal  gene,  the  24  cats  represent  48  copies  of  the  gene,  so  the 
probability  of  ultimate  fixation  is  V48and  the  probability  of  ultimate  loss 
is  47/48  • For  an  X-linked  gene,  the  24  cats  represent  36  copies  of  the  gene, 
so  the  probability  of  ultimate  fixation  is  V36  and  that  of  ultimate  loss  35/36. 

4.  The  equation,  ln(y2)  = -t/2N,  from  the  boxed  problem  7.6,  implies  that  t 
= 1.39N  generations  will  reduce  the  average  heterozygosity  by  half. 
Because  the  plant  is  an  annual,  one  generation  corresponds  to  one  year, 
so  in  this  case  t = 50,  giving  N = 36  as  the  effective  population  size. 

5.  Using  Equation  7.12  with  N = 20  and  t = 8 yields  0.18.  Since  self-fertil- 
ization cannot  occur,  use  of  N = 20.5  would  be  a bit  more  accurate,  but 
this  gives  essentially  the  same  answer.  The  value  FST  = 0.18  is  not  much 
less  than  the  inbreeding  coefficient  of  0.25  resulting  from  one  generation 
of  brother-sister  mating. 

6.  Dividing  and  taking  logs  of  both  sides  of  Equation  7.11  yields 
t = [ln(l  - Ft)  - ln(l  - F0)]/ln[l  - (1/2N)] 

From  F = 0.01  to  0.02  requires  N = 1 generation,  from  0.05  to  0.10 
requires  5.4  generations.  Using  the  suggested  approximations  yields 


Answers  to  Chapter-End  Problems 


499 


- Ft  = ~t/(2N)  -F0,  and  when  F,  = 2 F0l  then  t = 2 NF0.  For  F0  = 0.01  or  0.05, 
the  approximation  gives  t = 1 and  5 generations,  respectively. 

7.  Nf=  200,  and  the  number  of  breeding  males  equals  Nf/5  = 40.  In  this 
situation  Ne  = 4 NmNf/(Nm  + Nf)  = 4(40)(200)/(40  + 200)  = 133. 

8.  Here  Ne  = 4 NmNf/ (Nm  + Nf).  For  10  cows  and  one  bull,  Ne  = 4(1)(10)/ 

(1  + 10)  = 3.6;  for  40  cows  and  one  bull,  Ne  = 3.9;  for  10  cows  and  two 
bulls,  Ne  = 6.7. 

9.  The  appropriate  formula  is  Equation  7.25,  which  for  Nf=  100  and  Nm  = 
10  yields  37.5  and  for  Nf=  10  and  Nm  = 100  yields  21.4. 

10.  In  the  F|  population,  all  restriction  site  differences  are  heterozygous  so 
that  the  initial  RFLP  allele  frequencies  are  V2.  Equation  7.15  can  be 
applied  to  determine  the  expected  fraction  of  sites  that  remain  heterozy- 
gous. With  N = 80  and  H0  = 1.00,  for  t = 10,  H,  = exp(-10/160)  = 0.94  x 
100  = 94  sites  segregating.  For  t = 50,  the  same  approach  gives  73  sites 
segregating. 

11.  Using  Equation  7.38  with  Ne  = 50  and  t - 100  gives  (1/100)(99/100)10°  = 
0.0037. 

12.  Applying  Equation  7.34,  the  expected  number  of  sites  segregating  in 
samples  of  size  10,  20,  and  50  are  29.3,  36.0,  and  45.0,  respectively. 

13.  Use  equation  7.29  to  get  F = 1/(1  + 4Np)  = 1/(1  + 106  x 10-6)  = 0.5. 
Because  H = 1 - F,  the  heterozygosity  is  also  0.5.  For  the  second  part,  use 
the  equation  in  boxed  problem  7.9,  F = l/[4N(p  + m)  + 1],  and  substitute 
to  get  J/3  = l/[10b(10  h + m)  + 1],  so  m = 10~6.  Very  little  migration  is 
needed  (one  migrant  every  four  generations)  is  needed  to  make  the 
change  in  H. 

14.  The  distribution  of  expected  coalescence  times  is  given  by  Equation 
7.39,  and  the  mean  is  AN/[k(k  - 1)],  where  k is  the  sample  size.  Substitut- 
ing, we  get  200 /[k{k  - 1)]  = 10,  sok  = 5. 

15.  Use  Equation  7.15  with  H,  = H0/x,  which  yields  (1/x)  = exp(-f/2N),  or 
ln(l/x)  = -t/2N,  which  implies  that  t = 2Mnx.  For  x = 2,  the  value  t = 
1.39N,  agreeing  with  the  value  stated  in  the  text. 

16.  Substitute  H,  = H0/e  into  Equation  7.15  to  obtain  exp(-l)  = exp(-f/2N), 
whence  t = 2 N. 

17.  The  variance  in  frequency  caused  by  sampling  each  cell  generation  is 
pq/N.  Assuming  the  stem  cells  had  50%  of  each  mtDNA  type,  this  is 
1 / 4000.  The  95%  confidence  interval  in  the  allele  frequency  after  one 
generation  of  sampling  is  ± two  standard  deviations,  or  + 0.0316.  In  30 
cell  generations,  it  is  therefore  not  unlikely  to  drift  down  to  the  observed 
frequency  of  0.2.  Diffusion  methods  or  computer  simulation  could  be 
used  to  obtain  the  probability  of  the  observed  result. 
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CHAPTER  8 

1 . Apply  the  Jukes-Cantor  formula  (Equation  8.15),  to  get  k = 0.0517  when 
d = 0.05,  and  k = 2.03  when  d = 0.7.  Note  that  when  there  are  few  differ- 
ences, k and  d are  nearly  equal.  When  the  proportion  of  sites  that  differ 
nears  the  limiting  value  of  0.75,  there  can  he  more  than  one  expected 
substitution  per  site.  Inspection  of  Equation  8.16  shows  that  the  variance 
in  the  estimate  of  k increases  as  d increases. 

2.  The  analog  to  Equation  8.15  if  there  were  six  different  nucleotides  is  k = 
-(%)  ln(l  - 6d/5).  Substitute  d = 0.2  and  get  k = 0.229.  Note  that  for  four- 
base  DNA,  application  of  Equation  8.15  gives  k = 0.233.  The  compensa- 
tion for  potentially  missed  back  mutations  is  greater  for  four-base  than 
for  six-base  DNA. 

3.  If  you  assume  that  most  differences  are  synonymous  (third  position), 
then  reading  frame  begins  with  position  1 at  the  far  left.  The  sequences 
code  for  20  amino  acids,  of  which  two  are  different.  Altogether  there  are 
60  nucleotides  (13  different),  40  nonsynonymous  sites  (two  different),  20 
synonymous  sites  (11  different). 

a.  Use  Equation  8.5  with  D = 2/20,  K = 0.105. 

b.  Use  Equation  8.15  with  d = iym,  k = 0.256. 

c.  Use  Equation  8.15  with  d = 2/40,  k = 0.052. 

d.  Use  Equation  8.15  with  d = n/2o,  k = 0.991. 

4.  With  d = 0.33  in  Equation  8.15,  ic  = 0.43,  and  t = 0.43/(2  x 0.01 /yr)  = 21.5 
yr.  The  date  of  divergence  is  approximately  1983  - 21.5  = 1961  or  1962. 

5.  Values  of  k from  Equation  8.15  are 


HIV2  VISNA  MMLV 


HIV1 

HIV2 

VISNA 


0.45  0.95  1.31 

0.89  1.37 

1.37 


Data  indicate  that  HIV1  and  HIV2  are  most  closely  related  to  each  other, 
more  distantly  to  VISNA,  and  still  more  distantly  to  MMLV.  HIV1, 

HIV2,  and  VISNA  are  all  about  equally  distant  from  MMLV. 

6.  The  selective  constraints  are  probably  weak,  as  5 x 10-9  approximates  the 
rate  of  synonymous  substitution. 

7.  Most  likely  the  protein  is  undergoing  very  rapid  amino  acid  replace- 
ment because  of  natural  selection. 

8.  Compensatory  substitutions  might  be  expected,  i.e.,  an  A ->  G substitu- 
tion at  a site  matched  with  aT-^C  substitution  at  the  position  with 
which  it  pairs.  This  is  the  pattern  actually  observed. 
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9.  d = 3/4,  which  means  that  the  nucleotide  differences  are  as  great  as 
expected  in  comparing  two  random  sequences. 

10.  0.005  x 10  = 5 x 10-9  substitutions  along  a lineage.  The  rate  of  diver- 
gence between  two  lineages  is  twice  this  value,  or  1%  per  million  years. 

11.  The  proportion  of  synonymous  sites  that  substituted  is  34/310  = 0.11, 
and  the  proportion  of  nonsynonymous  sites  that  substituted  is  16/633  = 
0.025.  Using  these  as  the  observed  divergence  values  ( d ) in  Equation 
8.15,  the  estimates  of  the  number  of  substitutions  per  synonymous  and 
nonsynonymous  site  are  0.118  and  0.026,  respectively. 

12.  (a)  3N/4;  (b)  N/ 4;  (c)  N/4  . 

13.  The  level  of  intrapopulation  variation  would  differ  between  the  two  sce- 
narios. With  a high  rate  of  substitution,  there  would  be  more  intrapopu- 
lation variation,  and  less  interpopulation  differentiation  than  in  the  case 
with  low  rates  of  substitution  and  migration.  It  is  very  difficult  to  obtain 
reliable  estimates  of  the  neutral  mutation  rate  and  population  size 
because  the  models  nearly  always  confound  these  factors.  Mutation  and 
migration  are  similarly  confounded  in  models  of  subdivided  popula- 
tions. The  neutral  mutation  rate  can  be  estimated  if  one  is  willing  to 
compare  to  different  species  and  date  the  divergence  with  the  fossil 
record. 

14.  The  digits  on  the  branches  of  the  tree  indicate  the  site  position  along  the 
sequence  where  each  substitution  occurred: 

5 


C 

D 


H E 

2 


15.  (a)  Set  4 N = 10  x 21n(2N)  and  iterate.  Convergence  is  very  rapid  for  any 
starting  value,  and  N = 17.9;  (b)  N = 323.6. 

16.  Set /=  c2  and  solve  for  X = (n  - 1)/4N.  Then  c4  = c2  = 1/(1  + 4Np]  = 1 
because  4Np  « 1. 

17.  Cj  = c2  = 1 /n  ; /=  1. 

18.  When  t = half  life,  e~hl  = y2,  so  t = -In (V2)//z. 
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19.  Distribution  of  persistence  times  is  a standard  exponential  distribution 
with  parameter  h,  P(f)  = h exp(-/zf),  and  the  mean  of  an  exponential  is 
the  reciprocal  of  the  exponential  parameter. 

CHAPTER  9 

1.  Heritability  is  a quantitative  measure  of  variance  partitioning.  It  says 
nothing  about  the  ability  of  a trait  to  be  altered  by  environmental 
effects.  High  heritability  does  not  mean  that  a trait  cannot  be  modified 
by  medical  intervention  or  education.  This  point  is  especially  important 
in  interpreting  heritability  of  abilities  and  behaviors  in  humans. 

2.  p = 2206/30  = 73.5,  s2=  [(166,206/30)  - (2206/30)2](30/29)  = 137.64,  s = 
11.7. 

3.  The  range  18-22  is  one  standard  deviation,  so  a proportion  1 - 0.317  = 
0.683.  (b)  > 22  is  greater  than  one  standard  deviation,  so  proportion  = 
0.317/2  = 0.158.  (c)  > 24  proportion  = 0.046/2  = 0.023.  (d)  The  range  is 
20-22,  proportion  = 0.5  - 0.158  = 0.342.  (e)  < 16  proportion  = 0.046/2  = 
0.023. 

4.  s2  = 1.5,  s2  — 6.0  - 1.5  = 4.5,  and  the  broad  sense  heritability  = 4.5 /6.0  = 75%. 

5.  Phenotypic  standard  deviation  = 200  mg,  so  S = 400  mg;  /z2  = 
10,000/40,000  = 25%,  so  expected  mean  = 2000  + (0.25)(400)  = 2100. 

6.  After  one  generation,  p'=  20  + 0.25(4)  = 21;  after  10  generations  (using 
Equation  9.38),  p'  = 20  + 0.25(4)(10)  = 30  bristles. 

7.  /z2  = Response/ cumulative  selection  differential  = 0.15/ (0.07  x 5)  = 0.43. 

8.  (a )a=  (23.8  - 19.4)/2  = 2.2;  d=  [2(25.2)  - 23.8  - 19.4J/2  = 3.6.  (b)  Write 
values  as  if  they  were  relative  fitnesses,  i.e.,  1 - 0.056, 1,  1 - 0.230  for  AA, 
AA' , A' A'.  Since  maximum  of  mean  fitness  occurs  at  p = 0.230/(0.230  + 
0.056)  = 0.804,  this  value  will  maximize  the  mean  of  the  quantitative 
trait. 

9.  Using  Equation  9.39,  D=  3.73  + 2.01  = 5.74  and  £=  (0.87)2-  (0.60)2=  0.40, 
hence  n=  (5.74)2/8(0.40)  = 10.3. 

10.  Let  X = first  litter,  Y = second  litter.  Then  XX  = 104,  XX2  = 1106,  XY  = 103, 
XY2  = 1101,  XXY  = 1089,  o = [1089  - (104)(103)/10]/([106  - (104)2/10][1101 
- (103)2/10]}1/2  = 0.57. 

11 . Response//?2  = (220  - 180)/0.20  = 200  = Cumulative  selection  differential. 
Number  of  generations  = 200/20  = 10. 

12.  Selection  intensity  = i = [1  /V2zt]exp(-f2/2)/B,  where,  taking  the  top  half, 
for  example,  B = l/2  ar,d  t = 0.0,  so  z = 0.80.  For  the  other  proportions  (in 
order),  i = 1.27, 1.63, 1.95,  2.26. 
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13.  Use  subscripts  M and  F for  males  and  females.  SM  = m-  mM,  S¥  = m-  nip. 
Mean  of  selected  parents  is  (mM  + mM)/ 2,  so  proper  total  selection  differ- 
ential is  m - (mM  + mv)/ 2 = (SM  + SF)/2. 

14.  For  A and  B loci,  2 pq[a  + (q  - p)d]2  = 0.18  and  0.067,  respectively,  so  oj  = 
0.247,  and  h2  = ojj/ o2  = 0.247.  For  A and  B,  o(2  = (2 pqd)2  = 0.01  and  0,  so 
broad  sense  heritability  = 0.257,  and  0.247,  respectively. 

15.  Equation  9.10  can  be  written  as  R = ish2  where  i is  the  intensity  of  selec- 
tion. For  percent  protein,  R = (1.5)(0.45)(0.7)  = 0.47,  so  expected  percent 
protein  = 3.3%  + 0.47%  = 3.8%.  For  correlated  response  use  Equation 
9.45,  CR  = (1.5)(0.60)<l/2)  (0.70)<1/2|(0.55)(0.65)  = 0.35,  so  expected  percent 
fat  = 3.4%  + 0.35%  = 3.75%.  The  fat  increase  corresponds  to  a direct  selec- 
tion of  i=  0.35/ (0.60)(0.65)  = 0.90. 

16.  Let  the  phenotypes  of  AA,  AA' , A' A'  be  1,  0,  0 with  frequencies  p2,  2 pq, 
q2,  so  A'  is  dominant  with  frequency  q.  Then  a = V2,  d = -V2,  and  = 2 p3q. 
Then  cr  = p2  - (p2)2  = p2q(  1 + p).  Therefore  h2  = 2p3q/p2q{l  + p)  = 2(1  - q)/ 
(2  - q),  which  is  approximately  1 - q when  q is  small. 

17.  Let  fitnesses  of  AA,  AA' , A’ A'  be  1 - s,  1, 1 - t,  so  a=-{s  - f)/2  and  d = 

(s+  f)/2.  At  equilibrium  p=  t/(s+  t)  and  q=  s/(s+  t ),  so  a+  ( q-p)d=  0 
and  cra  = 0. 

18.  For  B=  y2,  y4,  y8,  y16,  y32  approximation  gives  i=  0.80, 1.25,  1.60, 1.91, 

2.21,  respectively. 

19.  Z = [l/V(27r)]exp[-(T-p)2/2cr]  and  2 = [l/V(27r)]exp(-t2/2)  = [1/V(2tc)] 
exp[-(T - p)2/2o2].  Therefore,  z = oZ. 
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Random  mating,  116 
dominance  and,  84-87 
frequency  of  heterozygotes,  87-88 
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polymorphism  and,  181, 183-186,  353-354 
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Ribosomal  RNA,  concerted  evolution  in, 
382-383 
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Sickle-cell  anemia,  123,  230,  251 
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"Sorting  by  reversals,"  388 
Southern  blot,  48-50 
Sparrow,  459 

Speciation,  gene  phytogenies  and,  370-371 
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in  prokaryotes,  6 
Transposable  element,  198,  207 
in  alleles,  8 

in  artificial  selection,  411 
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